This vignette introduces the concepts behind the comparison of two sets of items, as the functions make_dataset_splits and make_idx_splits implement.

Source data

Suppose that we have a set of observations with their associated source.
The \(j\)-th observation from the \(i\)-th source is indicated with \(X_{ij}\). In total, we have \(m\) sources, \(n_i\) observations per source.

The general model for the data is:

\[X_{ij} \sim f(\theta_i) \quad \text{iid } j = 1, ..., n_i \]

where \(\theta_i\) are parameters that characterise uniquely the \(i\)-th source.

For the purpose of illustration, let’s use the popular dataset chickwts:

data(chickwts)
head(chickwts)
#>   weight      feed
#> 1    179 horsebean
#> 2    160 horsebean
#> 3    136 horsebean
#> 4    227 horsebean
#> 5    217 horsebean
#> 6    168 horsebean

The data are the recorded weights of 71 chicks according to nlevels(chickwts$feed) different feed supplements (feed column). We consider feed as the source label: we assume that chicks are exchangeable given the administered feed.

Two-sample concept

This package is used in a forensic evaluative setting, where one has a set of observations from a known source (the reference set and the reference source), and a set of observations whose source(s) is (are) unknown (the questioned set).

One commonly states two competing hypotheses, e.g. whether the source of the questioned set is the reference source, or one (or more) different sources.
These hypotheses are usually named \(H_1\) and \(H_2\). Notice that \(H_2\) can consider a single alternative source, or multiple sources: all of them are unknown and different from the reference one.

This package assists with the generation of the reference and questioned set starting from the observed data.frame.

The dimensions of the reference and questioned sets can be specified, and are usually much smaller than the full data.
It follows that (a subset of) the rows which have not been picked constitute a third set, the background set.

This set is used in a Bayesian setting to learn the (hyper)priors for the statistical model.

Sample generation

To connect with the example, we consider horsebean feed as the reference source.
From the full data, we pick 5 observations to constitute each one of the reference and questioned sets. Let’s see how rsamplestudy does it:

set.seed(123)

library(rsamplestudy)

n_items <- 5
col_source <- 'feed'

list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', source_ref = 'horsebean')
list_split
#> $idx_reference
#> [1]  2  3  6  8 10
#> 
#> $idx_questioned
#> [1] 24 47 53 62 64
#> 
#> $idx_background
#>  [1]  1  4  5  7  9 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31
#> [26] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 48 49 50 51 52 54 55 56 57 58
#> [51] 59 60 61 63 65 66 67 68 69 70 71
#> 
#> $df_reference
#>    weight      feed
#> 2     160 horsebean
#> 3     136 horsebean
#> 6     168 horsebean
#> 8     124 horsebean
#> 10    140 horsebean
#> 
#> $df_questioned
#>    weight      feed
#> 24    230   soybean
#> 47    297 sunflower
#> 53    380  meatmeal
#> 62    379    casein
#> 64    404    casein
#> 
#> $df_background
#>    weight      feed
#> 1     179 horsebean
#> 4     227 horsebean
#> 5     217 horsebean
#> 7     108 horsebean
#> 9     143 horsebean
#> 11    309   linseed
#> 12    229   linseed
#> 13    181   linseed
#> 14    141   linseed
#> 15    260   linseed
#> 16    203   linseed
#> 17    148   linseed
#> 18    169   linseed
#> 19    213   linseed
#> 20    257   linseed
#> 21    244   linseed
#> 22    271   linseed
#> 23    243   soybean
#> 25    248   soybean
#> 26    327   soybean
#> 27    329   soybean
#> 28    250   soybean
#> 29    193   soybean
#> 30    271   soybean
#> 31    316   soybean
#> 32    267   soybean
#> 33    199   soybean
#> 34    171   soybean
#> 35    158   soybean
#> 36    248   soybean
#> 37    423 sunflower
#> 38    340 sunflower
#> 39    392 sunflower
#> 40    339 sunflower
#> 41    341 sunflower
#> 42    226 sunflower
#> 43    320 sunflower
#> 44    295 sunflower
#> 45    334 sunflower
#> 46    322 sunflower
#> 48    318 sunflower
#> 49    325  meatmeal
#> 50    257  meatmeal
#> 51    303  meatmeal
#> 52    315  meatmeal
#> 54    153  meatmeal
#> 55    263  meatmeal
#> 56    242  meatmeal
#> 57    206  meatmeal
#> 58    344  meatmeal
#> 59    258  meatmeal
#> 60    368    casein
#> 61    390    casein
#> 63    260    casein
#> 65    318    casein
#> 66    352    casein
#> 67    359    casein
#> 68    216    casein
#> 69    222    casein
#> 70    283    casein
#> 71    332    casein

The return value contains three lists: the indexes of the rows in chickwts, and three data frames (i.e. chickwts split up across the previous indexes).

Questioned source choice

Notice that it automatically picked a questioned source.
This behaviour is specified in make_dataset_splits documentation: if the questioned source is not specified, then any source different from the reference one is picked as a potential source. Then, items are randomly picked.

We can also force the questioned source(s): e.g. we pick items from 'horsebean' and 'soybean'

list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', 
                                  source_ref = 'horsebean', source_quest = c('horsebean', 'soybean'))
list_split
#> $idx_reference
#> [1] 1 3 5 8 9
#> 
#> $idx_questioned
#> [1]  6 24 25 26 27
#> 
#> $idx_background
#>  [1]  2  4  7 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35
#> [26] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#> [51] 61 62 63 64 65 66 67 68 69 70 71
#> 
#> $df_reference
#>   weight      feed
#> 1    179 horsebean
#> 3    136 horsebean
#> 5    217 horsebean
#> 8    124 horsebean
#> 9    143 horsebean
#> 
#> $df_questioned
#>    weight      feed
#> 6     168 horsebean
#> 24    230   soybean
#> 25    248   soybean
#> 26    327   soybean
#> 27    329   soybean
#> 
#> $df_background
#>    weight      feed
#> 2     160 horsebean
#> 4     227 horsebean
#> 7     108 horsebean
#> 10    140 horsebean
#> 11    309   linseed
#> 12    229   linseed
#> 13    181   linseed
#> 14    141   linseed
#> 15    260   linseed
#> 16    203   linseed
#> 17    148   linseed
#> 18    169   linseed
#> 19    213   linseed
#> 20    257   linseed
#> 21    244   linseed
#> 22    271   linseed
#> 23    243   soybean
#> 28    250   soybean
#> 29    193   soybean
#> 30    271   soybean
#> 31    316   soybean
#> 32    267   soybean
#> 33    199   soybean
#> 34    171   soybean
#> 35    158   soybean
#> 36    248   soybean
#> 37    423 sunflower
#> 38    340 sunflower
#> 39    392 sunflower
#> 40    339 sunflower
#> 41    341 sunflower
#> 42    226 sunflower
#> 43    320 sunflower
#> 44    295 sunflower
#> 45    334 sunflower
#> 46    322 sunflower
#> 47    297 sunflower
#> 48    318 sunflower
#> 49    325  meatmeal
#> 50    257  meatmeal
#> 51    303  meatmeal
#> 52    315  meatmeal
#> 53    380  meatmeal
#> 54    153  meatmeal
#> 55    263  meatmeal
#> 56    242  meatmeal
#> 57    206  meatmeal
#> 58    344  meatmeal
#> 59    258  meatmeal
#> 60    368    casein
#> 61    390    casein
#> 62    379    casein
#> 63    260    casein
#> 64    404    casein
#> 65    318    casein
#> 66    352    casein
#> 67    359    casein
#> 68    216    casein
#> 69    222    casein
#> 70    283    casein
#> 71    332    casein

Sampling with repetition

It is always guaranteed that no item appears more than once across sets.
It may be explicitely allowed to pick multiple times an item in a set (but it will never appear in other sets).

For example, chick counts per feed:

table(chickwts$feed)
#> 
#>    casein horsebean   linseed  meatmeal   soybean sunflower 
#>        12        10        12        11        14        12

We might want to obtain bootstrap samples from chicken who have been fed with horsebeans, by sampling 100 times.

n_items_rep <- 100
list_split <- make_dataset_splits(chickwts, 
                    k_ref = n_items_rep, 
                    k_quest = n_items_rep, col_source = 'feed', source_ref = 'horsebean')
#> Reference items: sampling with replacement is being used.
#> Questioned items: sampling with replacement is being used.

head(list_split$df_reference)
#>     weight      feed
#> 1      179 horsebean
#> 1.1    179 horsebean
#> 1.2    179 horsebean
#> 1.3    179 horsebean
#> 1.4    179 horsebean
#> 1.5    179 horsebean
head(list_split$df_questioned)
#>    weight    feed
#> 11    309 linseed
#> 12    229 linseed
#> 13    181 linseed
#> 14    141 linseed
#> 16    203 linseed
#> 17    148 linseed
head(list_split$df_background)
#>    weight      feed
#> 15    260   linseed
#> 22    271   linseed
#> 25    248   soybean
#> 27    329   soybean
#> 29    193   soybean
#> 37    423 sunflower

Items are never picked more than once across the sets:

intersect(list_split$idx_reference, list_split$idx_questioned)
#> integer(0)
intersect(list_split$idx_questioned, list_split$idx_background)
#> integer(0)
intersect(list_split$idx_reference, list_split$idx_background)
#> integer(0)

Background selection

The background set can be automatically constituted in three ways depending if the potential sampled sources (i.e. sources whose items could appear either in the reference or in the questioned set) are excluded or not.

make_dataset_splits accepts the parameter background which can take three values:

  1. background = 'outside' (default): the potential observed sources may appear in the background dataset.
set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', source_ref = 'horsebean',
                                  background = 'outside')

# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] soybean   sunflower meatmeal  casein   
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> [1] horsebean linseed   soybean   sunflower meatmeal  casein   
#> Levels: casein horsebean linseed meatmeal soybean sunflower
  1. background = 'others': the potential observed sources cannot appear in the background set.

Notice that if the potential sources span the entire available sources, than the background dataset must be empty.
It is the case when the questioned source is not specified, as it is automatically considered to be “all but the reference source”.

set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', 
                                  source_ref = 'horsebean', source_quest = NULL,
                                  background = 'others')
#> Warning in make_idx_splits(sources, k_ref = k_ref, k_quest = k_quest, ...): No
#> background data!

# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] soybean   sunflower meatmeal  casein   
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> factor(0)
#> Levels: casein horsebean linseed meatmeal soybean sunflower

Once the potential observed sources are restricted, the background selection matters:

set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', 
                                  source_ref = 'horsebean', source_quest = c('horsebean', 'soybean'),
                                  background = 'others')

# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] horsebean soybean  
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> [1] linseed   sunflower meatmeal  casein   
#> Levels: casein horsebean linseed meatmeal soybean sunflower
  1. background = 'unobserved': the observed sources (i.e. sources whose items have been sampled, either in the reference or in the questioned set) cannot appear in the background set.

Notice that previous example is no longer an error, as long as at least one source has never been observed:

set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', 
                                  source_ref = 'horsebean', source_quest = NULL,
                                  background = 'unobserved')

# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] soybean   sunflower meatmeal  casein   
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> [1] linseed
#> Levels: casein horsebean linseed meatmeal soybean sunflower