Generate reference/questioned/background observations from a multiple-source dataset.

The function splits a dataframe by rows, into a sample of reference items, questioned items and background items. The split is done according by the item source.

make_dataset_splits(df, k_ref, k_quest, col_source = "source", ...)

Arguments

df	all available data
k_ref	number of reference samples
k_quest	number of questioned samples
col_source	column containing the source identifier (string or column number)
...	Arguments passed on to `make_idx_splits` `source_ref` the reference source (scalar; if `NULL`, a random source will be picked) `source_quest` the questioned source(s) (if `NULL`, anything but the reference source: behaviour overridden by `same_source`) `same_source` if `source_quest` is `NULL` and `same_source` is `TRUE`, questioned source is the reference source; see details for conflict resolution `source_ref_allowed` allowed reference sources (if not `NULL` (default), reference source will be picked among these) `source_quest_allowed` allowed questioned sources (if not `NULL` (default), questioned source(s) will be picked among these) `background` see details (default: `'outside'`) `replace` use sampling with replacement, else error `strict` fail at any incoherence between parameters instead of giving warnings or assuming (default: `FALSE`)

Value

a list of indexes (idx_reference, idx_questioned, idx_background) and a list of dataframes (df_reference, df_questioned, df_background)

Details

Reference and questioned samples are always non-intersecting, even when the source is the same.

Sampling with replacement is used, if necessary and not forbidden.

Background selection

If background is 'outside', the background dataset comprises all items who do not lie in any of the reference and questioned sets. It can contain items from reference and questioned sources.
If background is 'others', the background dataset comprises all items from the non-reference and non-questioned potential sources.
If background is 'unobserved', the background dataset comprises all items from the sources who do not appear in any of the reference and questioned sets.

By default, background is 'outside'.

Notice that background = 'others' generates no background data if questioned sources are not specified: the union of reference and questioned sources fills the available sources in the population.

Source sampling

Sampling happens in steps:

a reference source is picked
questioned sources are picked
items from reference sources are picked
items from questioned sources are picked
remaining items form the background set: apply background restrictions

By design, the package identifies sources which are allowed to be sampled from. By default, all available sources can appear in the reference or questioned samples.

This restriction can be modified using the parameters source_ref_allowed and source_quest_allowed. It is also forbidden to specify a source_ref or a source_quest which are not allowed.

The behaviour of same_source, when specified, is stronger, and source_quest_allowed is overridden.

If source_quest is NULL:

if same_source is NULL or FALSE, questioned items are sampled from all but the reference source (or source_quest_allowed).
if same_source is TRUE, questioned items are sampled from the reference source (source_ref_allowed is ignored).

If source_quest is not NULL:

same_source always has priority: if specified, source_quest will be ignored, the chosen reference source will be picked.
if source_quest conflicts with same_source, an error is raised.

Else, questioned items will be sampled from the questioned source(s), even if it contains the reference one.

Items will never be sampled once (unless replace is TRUE): they appear once in the reference/questioned/background items.

Examples

if (FALSE) {
# Sample different species
make_dataset_splits(iris, 5, 5, col_source = 'Species')

# Sample same species
make_dataset_splits(iris, 5, 5, col_source = 'Species', same_source = TRUE)

# Sample from custom species
make_dataset_splits(iris, 5, 5, col_source = 'Species',
   source_ref = 'virginica', source_quest = 'versicolor')
make_dataset_splits(iris, 5, 5, col_source = 'Species',
   source_ref = 'virginica', source_quest = c('virginica', 'versicolor'))

# Sample from reference source with replacement
make_dataset_splits(iris, 500, 5, col_source = 'Species', replace = TRUE)

# Use background sources from non-sampled items
make_dataset_splits(iris, 50, 50, col_source = 'Species',
   source_ref = 'virginica', source_quest = 'versicolor', background = 'others')
}