Dirichlet-Dirichlet study generation

A study is the set of parameters in a model, data generated from those parameters (population), and three sets of samples: the reference/questioned/background items.

This package implements the generation of selected studies.
This vignette describes the Dirichlet-Dirichlet model.

The model

Consider Dirichlet samples \(X_i\) from \(m\) different sources. Each source is sampled \(n\) times:

\(\boldsymbol{X}_{ij} \mid \boldsymbol{\theta_i} \sim \text{Dirichlet}(\boldsymbol{\theta}_i)\) iid \(\forall j = 1, \ldots, n \,\) with \(\,i = 1, \ldots, m\)
\(\boldsymbol{\theta}_i \mid \boldsymbol{\alpha} \sim \text{Dirichlet}(\boldsymbol{\alpha}) \quad\) iid \(\forall i = 1, \ldots, m\)

We assume that \(\boldsymbol{\alpha}\) is known.

Population generation

The population can be generated using fun_rdirichlet_population:

# Population parameters:
# Number of sources
n <- 10
# Number of items per source
m <- 20
# Number of observations per item
p <- 4

list_pop <- fun_rdirichlet_population(n, m, p)

The output contains:

the population df_pop
the model parameters df_sources and alpha
names of the components of the source vectors names_source
names of the components of the item vectors names_var

Notice that the hyperparameter is sampled, too (but it can be fixed).

head(list_pop$df_pop)
#> # A tibble: 6 x 5
#>   source   `x[1]` `x[2]` `x[3]` `x[4]`
#>    <int>    <dbl>  <dbl>  <dbl>  <dbl>
#> 1      1 5.64e-38      0  1          0
#> 2      1 1.66e-35      0  1          0
#> 3      1 2.88e-31      0  1          0
#> 4      1 1.19e-46      0  1          0
#> 5      1 1.51e- 3      0  0.998      0
#> 6      1 2.14e-25      0  1          0
head(list_pop$df_sources)
#> # A tibble: 6 x 5
#>   source `theta[1]` `theta[2]` `theta[3]` `theta[4]`
#>    <int>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1      1    0.0115    0.000113     0.988   1.30e-124
#> 2      2    0.770     0.186        0.0431  3.83e-290
#> 3      3    0.00198   0.436        0.562   2.41e-122
#> 4      4    0.554     0.0287       0.417   1.91e-229
#> 5      5    0.162     0.615        0.222   1.41e- 41
#> 6      6    0.837     0.00159      0.162   0.

Hyperparameters

We assume that the Dirichlet hyperparameter (the level farther from the data) comes from the Uniform distribution on the (p-1)-Simplex.
In other words, we will sample the Dirichlet hyperparameter from the \(\text{Dirichlet}{(\boldsymbol{1})}\) distribution.

The shortcut function the package is fun_rdirichlet_hyperparameter:

df_diri <- purrr::map_dfr(1:300, ~ fun_rdirichlet_hyperparameter(3))
scatter_matrix_simplex(df_diri)

Partitioning

Once the population is generated, the reference/questioned/background samples must be extracted.
This is generically done using make_dataset_splits:


k_ref <- 10
k_quest <- 5

list_samples <- make_dataset_splits(list_pop$df_pop, k_ref, k_quest)
names(list_samples)
#> [1] "idx_reference"  "idx_questioned" "idx_background" "df_reference"  
#> [5] "df_questioned"  "df_background"

head(list_samples$df_reference)
#> # A tibble: 6 x 5
#>   source `x[1]`     `x[2]` `x[3]` `x[4]`
#>    <int>  <dbl>      <dbl>  <dbl>  <dbl>
#> 1     18      0 0.905      0.0951      0
#> 2     18      0 0.381      0.619       0
#> 3     18      0 0.00000422 1.00        0
#> 4     18      0 0.0402     0.960       0
#> 5     18      0 0.491      0.509       0
#> 6     18      0 0.000656   0.999       0
head(list_samples$df_questioned)
#> # A tibble: 5 x 5
#>   source `x[1]`        `x[2]`    `x[3]` `x[4]`
#>    <int>  <dbl>         <dbl>     <dbl>  <dbl>
#> 1      4  0.476 0.00000000186 0.524          0
#> 2      5  0.507 0.493         0.0000482      0
#> 3      7  0     0             1              0
#> 4     11  0     0.0266        0.973          0
#> 5     16  0     0.0155        0.984          0
head(list_samples$df_background)
#> # A tibble: 6 x 5
#>   source   `x[1]` `x[2]` `x[3]` `x[4]`
#>    <int>    <dbl>  <dbl>  <dbl>  <dbl>
#> 1      1 5.64e-38      0  1          0
#> 2      1 1.66e-35      0  1          0
#> 3      1 2.88e-31      0  1          0
#> 4      1 1.19e-46      0  1          0
#> 5      1 1.51e- 3      0  0.998      0
#> 6      1 2.14e-25      0  1          0

Source parameters

The chosen sources can be fixed.

See the documentation for make_dataset_splits.

Lorenzo Gaborini

2021-03-05

The model

Population generation

Hyperparameters

Partitioning

Source parameters