vignettes/sampling_study.Rmd
sampling_study.Rmd
This vignette introduces the concepts behind the comparison of two sets of items, as the functions make_dataset_splits
and make_idx_splits
implement.
Suppose that we have a set of observations with their associated source.
The \(j\)-th observation from the \(i\)-th source is indicated with \(X_{ij}\). In total, we have \(m\) sources, \(n_i\) observations per source.
The general model for the data is:
\[X_{ij} \sim f(\theta_i) \quad \text{iid } j = 1, ..., n_i \]
where \(\theta_i\) are parameters that characterise uniquely the \(i\)-th source.
For the purpose of illustration, let’s use the popular dataset chickwts
:
data(chickwts)
head(chickwts)
#> weight feed
#> 1 179 horsebean
#> 2 160 horsebean
#> 3 136 horsebean
#> 4 227 horsebean
#> 5 217 horsebean
#> 6 168 horsebean
The data are the recorded weights of 71 chicks according to nlevels(chickwts$feed)
different feed supplements (feed
column). We consider feed
as the source label: we assume that chicks are exchangeable given the administered feed.
This package is used in a forensic evaluative setting, where one has a set of observations from a known source (the reference set and the reference source), and a set of observations whose source(s) is (are) unknown (the questioned set).
One commonly states two competing hypotheses, e.g. whether the source of the questioned set is the reference source, or one (or more) different sources.
These hypotheses are usually named \(H_1\) and \(H_2\). Notice that \(H_2\) can consider a single alternative source, or multiple sources: all of them are unknown and different from the reference one.
This package assists with the generation of the reference and questioned set starting from the observed data.frame.
The dimensions of the reference and questioned sets can be specified, and are usually much smaller than the full data.
It follows that (a subset of) the rows which have not been picked constitute a third set, the background set.
This set is used in a Bayesian setting to learn the (hyper)priors for the statistical model.
To connect with the example, we consider horsebean
feed as the reference source.
From the full data, we pick 5 observations to constitute each one of the reference and questioned sets. Let’s see how rsamplestudy
does it:
set.seed(123)
library(rsamplestudy)
n_items <- 5
col_source <- 'feed'
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', source_ref = 'horsebean')
list_split
#> $idx_reference
#> [1] 2 3 6 8 10
#>
#> $idx_questioned
#> [1] 24 47 53 62 64
#>
#> $idx_background
#> [1] 1 4 5 7 9 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31
#> [26] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 48 49 50 51 52 54 55 56 57 58
#> [51] 59 60 61 63 65 66 67 68 69 70 71
#>
#> $df_reference
#> weight feed
#> 2 160 horsebean
#> 3 136 horsebean
#> 6 168 horsebean
#> 8 124 horsebean
#> 10 140 horsebean
#>
#> $df_questioned
#> weight feed
#> 24 230 soybean
#> 47 297 sunflower
#> 53 380 meatmeal
#> 62 379 casein
#> 64 404 casein
#>
#> $df_background
#> weight feed
#> 1 179 horsebean
#> 4 227 horsebean
#> 5 217 horsebean
#> 7 108 horsebean
#> 9 143 horsebean
#> 11 309 linseed
#> 12 229 linseed
#> 13 181 linseed
#> 14 141 linseed
#> 15 260 linseed
#> 16 203 linseed
#> 17 148 linseed
#> 18 169 linseed
#> 19 213 linseed
#> 20 257 linseed
#> 21 244 linseed
#> 22 271 linseed
#> 23 243 soybean
#> 25 248 soybean
#> 26 327 soybean
#> 27 329 soybean
#> 28 250 soybean
#> 29 193 soybean
#> 30 271 soybean
#> 31 316 soybean
#> 32 267 soybean
#> 33 199 soybean
#> 34 171 soybean
#> 35 158 soybean
#> 36 248 soybean
#> 37 423 sunflower
#> 38 340 sunflower
#> 39 392 sunflower
#> 40 339 sunflower
#> 41 341 sunflower
#> 42 226 sunflower
#> 43 320 sunflower
#> 44 295 sunflower
#> 45 334 sunflower
#> 46 322 sunflower
#> 48 318 sunflower
#> 49 325 meatmeal
#> 50 257 meatmeal
#> 51 303 meatmeal
#> 52 315 meatmeal
#> 54 153 meatmeal
#> 55 263 meatmeal
#> 56 242 meatmeal
#> 57 206 meatmeal
#> 58 344 meatmeal
#> 59 258 meatmeal
#> 60 368 casein
#> 61 390 casein
#> 63 260 casein
#> 65 318 casein
#> 66 352 casein
#> 67 359 casein
#> 68 216 casein
#> 69 222 casein
#> 70 283 casein
#> 71 332 casein
The return value contains three lists: the indexes of the rows in chickwts
, and three data frames (i.e. chickwts
split up across the previous indexes).
Notice that it automatically picked a questioned source.
This behaviour is specified in make_dataset_splits
documentation: if the questioned source is not specified, then any source different from the reference one is picked as a potential source. Then, items are randomly picked.
We can also force the questioned source(s): e.g. we pick items from 'horsebean'
and 'soybean'
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed',
source_ref = 'horsebean', source_quest = c('horsebean', 'soybean'))
list_split
#> $idx_reference
#> [1] 1 3 5 8 9
#>
#> $idx_questioned
#> [1] 6 24 25 26 27
#>
#> $idx_background
#> [1] 2 4 7 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35
#> [26] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#> [51] 61 62 63 64 65 66 67 68 69 70 71
#>
#> $df_reference
#> weight feed
#> 1 179 horsebean
#> 3 136 horsebean
#> 5 217 horsebean
#> 8 124 horsebean
#> 9 143 horsebean
#>
#> $df_questioned
#> weight feed
#> 6 168 horsebean
#> 24 230 soybean
#> 25 248 soybean
#> 26 327 soybean
#> 27 329 soybean
#>
#> $df_background
#> weight feed
#> 2 160 horsebean
#> 4 227 horsebean
#> 7 108 horsebean
#> 10 140 horsebean
#> 11 309 linseed
#> 12 229 linseed
#> 13 181 linseed
#> 14 141 linseed
#> 15 260 linseed
#> 16 203 linseed
#> 17 148 linseed
#> 18 169 linseed
#> 19 213 linseed
#> 20 257 linseed
#> 21 244 linseed
#> 22 271 linseed
#> 23 243 soybean
#> 28 250 soybean
#> 29 193 soybean
#> 30 271 soybean
#> 31 316 soybean
#> 32 267 soybean
#> 33 199 soybean
#> 34 171 soybean
#> 35 158 soybean
#> 36 248 soybean
#> 37 423 sunflower
#> 38 340 sunflower
#> 39 392 sunflower
#> 40 339 sunflower
#> 41 341 sunflower
#> 42 226 sunflower
#> 43 320 sunflower
#> 44 295 sunflower
#> 45 334 sunflower
#> 46 322 sunflower
#> 47 297 sunflower
#> 48 318 sunflower
#> 49 325 meatmeal
#> 50 257 meatmeal
#> 51 303 meatmeal
#> 52 315 meatmeal
#> 53 380 meatmeal
#> 54 153 meatmeal
#> 55 263 meatmeal
#> 56 242 meatmeal
#> 57 206 meatmeal
#> 58 344 meatmeal
#> 59 258 meatmeal
#> 60 368 casein
#> 61 390 casein
#> 62 379 casein
#> 63 260 casein
#> 64 404 casein
#> 65 318 casein
#> 66 352 casein
#> 67 359 casein
#> 68 216 casein
#> 69 222 casein
#> 70 283 casein
#> 71 332 casein
It is always guaranteed that no item appears more than once across sets.
It may be explicitely allowed to pick multiple times an item in a set (but it will never appear in other sets).
For example, chick counts per feed:
table(chickwts$feed)
#>
#> casein horsebean linseed meatmeal soybean sunflower
#> 12 10 12 11 14 12
We might want to obtain bootstrap samples from chicken who have been fed with horsebeans, by sampling 100 times.
n_items_rep <- 100
list_split <- make_dataset_splits(chickwts,
k_ref = n_items_rep,
k_quest = n_items_rep, col_source = 'feed', source_ref = 'horsebean')
#> Reference items: sampling with replacement is being used.
#> Questioned items: sampling with replacement is being used.
head(list_split$df_reference)
#> weight feed
#> 1 179 horsebean
#> 1.1 179 horsebean
#> 1.2 179 horsebean
#> 1.3 179 horsebean
#> 1.4 179 horsebean
#> 1.5 179 horsebean
head(list_split$df_questioned)
#> weight feed
#> 11 309 linseed
#> 12 229 linseed
#> 13 181 linseed
#> 14 141 linseed
#> 16 203 linseed
#> 17 148 linseed
head(list_split$df_background)
#> weight feed
#> 15 260 linseed
#> 22 271 linseed
#> 25 248 soybean
#> 27 329 soybean
#> 29 193 soybean
#> 37 423 sunflower
Items are never picked more than once across the sets:
The background set can be automatically constituted in three ways depending if the potential sampled sources (i.e. sources whose items could appear either in the reference or in the questioned set) are excluded or not.
make_dataset_splits
accepts the parameter background
which can take three values:
background = 'outside'
(default): the potential observed sources may appear in the background dataset.
set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed', source_ref = 'horsebean',
background = 'outside')
# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] soybean sunflower meatmeal casein
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> [1] horsebean linseed soybean sunflower meatmeal casein
#> Levels: casein horsebean linseed meatmeal soybean sunflower
background = 'others'
: the potential observed sources cannot appear in the background set.Notice that if the potential sources span the entire available sources, than the background dataset must be empty.
It is the case when the questioned source is not specified, as it is automatically considered to be “all but the reference source”.
set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed',
source_ref = 'horsebean', source_quest = NULL,
background = 'others')
#> Warning in make_idx_splits(sources, k_ref = k_ref, k_quest = k_quest, ...): No
#> background data!
# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] soybean sunflower meatmeal casein
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> factor(0)
#> Levels: casein horsebean linseed meatmeal soybean sunflower
Once the potential observed sources are restricted, the background selection matters:
set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed',
source_ref = 'horsebean', source_quest = c('horsebean', 'soybean'),
background = 'others')
# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] horsebean soybean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> [1] linseed sunflower meatmeal casein
#> Levels: casein horsebean linseed meatmeal soybean sunflower
background = 'unobserved'
: the observed sources (i.e. sources whose items have been sampled, either in the reference or in the questioned set) cannot appear in the background set.Notice that previous example is no longer an error, as long as at least one source has never been observed:
set.seed(123)
list_split <- make_dataset_splits(chickwts, k_ref = n_items, k_quest = n_items, col_source = 'feed',
source_ref = 'horsebean', source_quest = NULL,
background = 'unobserved')
# Observed reference source
unique(list_split$df_reference$feed)
#> [1] horsebean
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Observed questioned sources
unique(list_split$df_questioned$feed)
#> [1] soybean sunflower meatmeal casein
#> Levels: casein horsebean linseed meatmeal soybean sunflower
# Sources in background
unique(list_split$df_background$feed)
#> [1] linseed
#> Levels: casein horsebean linseed meatmeal soybean sunflower