vignettes/dirichletdirichlet-study.Rmd
dirichletdirichlet-study.Rmd
A study is the set of parameters in a model, data generated from those parameters (population), and three sets of samples: the reference/questioned/background items.
This package implements the generation of selected studies.
This vignette describes the Dirichlet-Dirichlet model.
Consider Dirichlet samples \(X_i\) from \(m\) different sources. Each source is sampled \(n\) times:
We assume that \(\boldsymbol{\alpha}\) is known.
The population can be generated using fun_rdirichlet_population
:
# Population parameters:
# Number of sources
n <- 10
# Number of items per source
m <- 20
# Number of observations per item
p <- 4
list_pop <- fun_rdirichlet_population(n, m, p)
The output contains:
df_pop
df_sources
and alpha
names_source
names_var
Notice that the hyperparameter is sampled, too (but it can be fixed).
head(list_pop$df_pop)
#> # A tibble: 6 x 5
#> source `x[1]` `x[2]` `x[3]` `x[4]`
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5.64e-38 0 1 0
#> 2 1 1.66e-35 0 1 0
#> 3 1 2.88e-31 0 1 0
#> 4 1 1.19e-46 0 1 0
#> 5 1 1.51e- 3 0 0.998 0
#> 6 1 2.14e-25 0 1 0
head(list_pop$df_sources)
#> # A tibble: 6 x 5
#> source `theta[1]` `theta[2]` `theta[3]` `theta[4]`
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.0115 0.000113 0.988 1.30e-124
#> 2 2 0.770 0.186 0.0431 3.83e-290
#> 3 3 0.00198 0.436 0.562 2.41e-122
#> 4 4 0.554 0.0287 0.417 1.91e-229
#> 5 5 0.162 0.615 0.222 1.41e- 41
#> 6 6 0.837 0.00159 0.162 0.
We assume that the Dirichlet hyperparameter (the level farther from the data) comes from the Uniform distribution on the (p-1)-Simplex.
In other words, we will sample the Dirichlet hyperparameter from the \(\text{Dirichlet}{(\boldsymbol{1})}\) distribution.
The shortcut function the package is fun_rdirichlet_hyperparameter
:
df_diri <- purrr::map_dfr(1:300, ~ fun_rdirichlet_hyperparameter(3))
scatter_matrix_simplex(df_diri)
Once the population is generated, the reference/questioned/background samples must be extracted.
This is generically done using make_dataset_splits
:
k_ref <- 10
k_quest <- 5
list_samples <- make_dataset_splits(list_pop$df_pop, k_ref, k_quest)
names(list_samples)
#> [1] "idx_reference" "idx_questioned" "idx_background" "df_reference"
#> [5] "df_questioned" "df_background"
head(list_samples$df_reference)
#> # A tibble: 6 x 5
#> source `x[1]` `x[2]` `x[3]` `x[4]`
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 18 0 0.905 0.0951 0
#> 2 18 0 0.381 0.619 0
#> 3 18 0 0.00000422 1.00 0
#> 4 18 0 0.0402 0.960 0
#> 5 18 0 0.491 0.509 0
#> 6 18 0 0.000656 0.999 0
head(list_samples$df_questioned)
#> # A tibble: 5 x 5
#> source `x[1]` `x[2]` `x[3]` `x[4]`
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0.476 0.00000000186 0.524 0
#> 2 5 0.507 0.493 0.0000482 0
#> 3 7 0 0 1 0
#> 4 11 0 0.0266 0.973 0
#> 5 16 0 0.0155 0.984 0
head(list_samples$df_background)
#> # A tibble: 6 x 5
#> source `x[1]` `x[2]` `x[3]` `x[4]`
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5.64e-38 0 1 0
#> 2 1 1.66e-35 0 1 0
#> 3 1 2.88e-31 0 1 0
#> 4 1 1.19e-46 0 1 0
#> 5 1 1.51e- 3 0 0.998 0
#> 6 1 2.14e-25 0 1 0