R/ref_quest_split.R
make_dataset_splits.Rd
The function splits a dataframe by rows, into a sample of reference items, questioned items and background items. The split is done according by the item source.
make_dataset_splits(df, k_ref, k_quest, col_source = "source", ...)
df | all available data |
---|---|
k_ref | number of reference samples |
k_quest | number of questioned samples |
col_source | column containing the source identifier (string or column number) |
... | Arguments passed on to
|
a list of indexes (idx_reference
, idx_questioned
, idx_background
) and a list of dataframes (df_reference
, df_questioned
, df_background
)
Reference and questioned samples are always non-intersecting, even when the source is the same.
Sampling with replacement is used, if necessary and not forbidden.
If background
is 'outside'
, the background dataset comprises all items who do not lie in any of the reference and questioned sets.
It can contain items from reference and questioned sources.
If background
is 'others'
, the background dataset comprises all items from the non-reference and non-questioned potential sources.
If background
is 'unobserved'
, the background dataset comprises all items from the sources who do not appear in any of the reference and questioned sets.
By default, background
is 'outside'
.
Notice that background = 'others'
generates no background data if questioned sources are not specified:
the union of reference and questioned sources fills the available sources in the population.
Sampling happens in steps:
a reference source is picked
questioned sources are picked
items from reference sources are picked
items from questioned sources are picked
remaining items form the background set: apply background restrictions
By design, the package identifies sources which are allowed to be sampled from. By default, all available sources can appear in the reference or questioned samples.
This restriction can be modified using the parameters source_ref_allowed
and source_quest_allowed
.
It is also forbidden to specify a source_ref
or a source_quest
which are not allowed.
The behaviour of same_source
, when specified, is stronger, and source_quest_allowed
is overridden.
If source_quest
is NULL
:
if same_source
is NULL
or FALSE
, questioned items are sampled from all but the reference source (or source_quest_allowed
).
if same_source
is TRUE
, questioned items are sampled from the reference source (source_ref_allowed
is ignored).
If source_quest
is not NULL
:
same_source
always has priority: if specified, source_quest
will be ignored, the chosen reference source will be picked.
if source_quest
conflicts with same_source
, an error is raised.
Else, questioned items will be sampled from the questioned source(s), even if it contains the reference one.
Items will never be sampled once (unless replace
is TRUE
): they appear once in the reference/questioned/background items.
Other set sampling functions:
make_idx_splits()
if (FALSE) { # Sample different species make_dataset_splits(iris, 5, 5, col_source = 'Species') # Sample same species make_dataset_splits(iris, 5, 5, col_source = 'Species', same_source = TRUE) # Sample from custom species make_dataset_splits(iris, 5, 5, col_source = 'Species', source_ref = 'virginica', source_quest = 'versicolor') make_dataset_splits(iris, 5, 5, col_source = 'Species', source_ref = 'virginica', source_quest = c('virginica', 'versicolor')) # Sample from reference source with replacement make_dataset_splits(iris, 500, 5, col_source = 'Species', replace = TRUE) # Use background sources from non-sampled items make_dataset_splits(iris, 50, 50, col_source = 'Species', source_ref = 'virginica', source_quest = 'versicolor', background = 'others') }