The function splits a dataframe by rows, into a sample of reference items, questioned items and background items. The split is done according by the item source.

make_dataset_splits(df, k_ref, k_quest, col_source = "source", ...)

Arguments

df

all available data

k_ref

number of reference samples

k_quest

number of questioned samples

col_source

column containing the source identifier (string or column number)

...

Arguments passed on to make_idx_splits

source_ref

the reference source (scalar; if NULL, a random source will be picked)

source_quest

the questioned source(s) (if NULL, anything but the reference source: behaviour overridden by same_source)

same_source

if source_quest is NULL and same_source is TRUE, questioned source is the reference source; see details for conflict resolution

source_ref_allowed

allowed reference sources (if not NULL (default), reference source will be picked among these)

source_quest_allowed

allowed questioned sources (if not NULL (default), questioned source(s) will be picked among these)

background

see details (default: 'outside')

replace

use sampling with replacement, else error

strict

fail at any incoherence between parameters instead of giving warnings or assuming (default: FALSE)

Value

a list of indexes (idx_reference, idx_questioned, idx_background) and a list of dataframes (df_reference, df_questioned, df_background)

Details

Reference and questioned samples are always non-intersecting, even when the source is the same.

Sampling with replacement is used, if necessary and not forbidden.

Background selection

  • If background is 'outside', the background dataset comprises all items who do not lie in any of the reference and questioned sets. It can contain items from reference and questioned sources.

  • If background is 'others', the background dataset comprises all items from the non-reference and non-questioned potential sources.

  • If background is 'unobserved', the background dataset comprises all items from the sources who do not appear in any of the reference and questioned sets.

By default, background is 'outside'.

Notice that background = 'others' generates no background data if questioned sources are not specified: the union of reference and questioned sources fills the available sources in the population.

Source sampling

Sampling happens in steps:

  1. a reference source is picked

  2. questioned sources are picked

  3. items from reference sources are picked

  4. items from questioned sources are picked

  5. remaining items form the background set: apply background restrictions

By design, the package identifies sources which are allowed to be sampled from. By default, all available sources can appear in the reference or questioned samples.

This restriction can be modified using the parameters source_ref_allowed and source_quest_allowed. It is also forbidden to specify a source_ref or a source_quest which are not allowed.

The behaviour of same_source, when specified, is stronger, and source_quest_allowed is overridden.

If source_quest is NULL:

  • if same_source is NULL or FALSE, questioned items are sampled from all but the reference source (or source_quest_allowed).

  • if same_source is TRUE, questioned items are sampled from the reference source (source_ref_allowed is ignored).

If source_quest is not NULL:

  • same_source always has priority: if specified, source_quest will be ignored, the chosen reference source will be picked.

  • if source_quest conflicts with same_source, an error is raised.

Else, questioned items will be sampled from the questioned source(s), even if it contains the reference one.

Items will never be sampled once (unless replace is TRUE): they appear once in the reference/questioned/background items.

See also

Other set sampling functions: make_idx_splits()

Examples

if (FALSE) { # Sample different species make_dataset_splits(iris, 5, 5, col_source = 'Species') # Sample same species make_dataset_splits(iris, 5, 5, col_source = 'Species', same_source = TRUE) # Sample from custom species make_dataset_splits(iris, 5, 5, col_source = 'Species', source_ref = 'virginica', source_quest = 'versicolor') make_dataset_splits(iris, 5, 5, col_source = 'Species', source_ref = 'virginica', source_quest = c('virginica', 'versicolor')) # Sample from reference source with replacement make_dataset_splits(iris, 500, 5, col_source = 'Species', replace = TRUE) # Use background sources from non-sampled items make_dataset_splits(iris, 50, 50, col_source = 'Species', source_ref = 'virginica', source_quest = 'versicolor', background = 'others') }