R/ref_quest_split.R
make_idx_splits.Rd
The function splits a list of items (rows) into a sample of reference items, questioned items and background items.
make_idx_splits( sources, k_ref, k_quest, source_ref = NULL, source_quest = NULL, source_ref_allowed = NULL, source_quest_allowed = NULL, same_source = NULL, background = "outside", replace = TRUE, strict = FALSE )
sources | all class labels |
---|---|
k_ref | number of reference samples |
k_quest | number of questioned samples |
source_ref | the reference source (scalar; if |
source_quest | the questioned source(s) (if |
source_ref_allowed | allowed reference sources (if not |
source_quest_allowed | allowed questioned sources (if not |
same_source | if |
background | see details (default: |
replace | use sampling with replacement, else error |
strict | fail at any incoherence between parameters instead of giving warnings or assuming (default: |
list of indexes (idx_reference
, idx_questioned
, idx_background
)
Reference/questioned/background samples are always non-intersecting, even when the source is the same.
Sampling with replacement is used, if necessary and not forbidden. If it is used, a message appears.
It is always guaranteed that no sample appear more than once across reference/questioned/background items (but it can appear multiple times in a set if replace
is TRUE
).
Sampling happens in steps:
a reference source is picked
questioned sources are picked
items from reference sources are picked
items from questioned sources are picked
remaining items form the background set: apply background restrictions
By design, the package identifies sources which are allowed to be sampled from. By default, all available sources can appear in the reference or questioned samples.
This restriction can be modified using the parameters source_ref_allowed
and source_quest_allowed
.
It is also forbidden to specify a source_ref
or a source_quest
which are not allowed.
The behaviour of same_source
, when specified, is stronger, and source_quest_allowed
is overridden.
If source_quest
is NULL
:
if same_source
is NULL
or FALSE
, questioned items are sampled from all but the reference source (or source_quest_allowed
).
if same_source
is TRUE
, questioned items are sampled from the reference source (source_ref_allowed
is ignored).
If source_quest
is not NULL
:
same_source
always has priority: if specified, source_quest
will be ignored, the chosen reference source will be picked.
if source_quest
conflicts with same_source
, an error is raised.
Else, questioned items will be sampled from the questioned source(s), even if it contains the reference one.
Items will never be sampled once (unless replace
is TRUE
): they appear once in the reference/questioned/background items.
If background
is 'outside'
, the background dataset comprises all items who do not lie in any of the reference and questioned sets.
It can contain items from reference and questioned sources.
If background
is 'others'
, the background dataset comprises all items from the non-reference and non-questioned potential sources.
If background
is 'unobserved'
, the background dataset comprises all items from the sources who do not appear in any of the reference and questioned sets.
By default, background
is 'outside'
.
Notice that background = 'others'
generates no background data if questioned sources are not specified:
the union of reference and questioned sources fills the available sources in the population.
Other set sampling functions:
make_dataset_splits()