Copyright 2019 Faustine Bousquet ([email protected] or [email protected]) from TabMo and IMAG (Institut Montpelliérain Alexander Grothendieck, University of Montpellier). The binomialMix package is available under the Apache2 license.

The **binomialMix** package provides a clustering method for longitudinal and non gaussian data. It uses an EM algorithm for GLM. For now, a model-based clustering for mixture of binomial data is available.

You can install the `binomialMix`

R package with the following R command:

```
# install.packages("devtools")
devtools::install_git("https://gitlab.com/tabmo/binomialmix")
devtools::install_gitlab("tabmo/binomialMix")
```

You can also directly use the git repository :

`git clone https://gitlab.com/tabmo/binomialMix`

Once you cloned the git repository, you can run to install the `binomialMix`

package:

`devtools::install("/path/to/binomialMix/pkg") # edit the path`

Imagine that you are working for an advertising company. You need to make groups of campaigns with similar profiles.

```
# our library for mixture modelling:
library(binomialMix)
# if not installed :
#install.packages("pander", repos="http://cran.us.r-project.org")
#install.packages("ggplot2", repos="http://cran.us.r-project.org")
#library(pander)
library(qpdf)
```

`data(adcampaign)`

```
## id timestamp_ymd yearDay day timeSlot app_or_site impressions click
## 1 14 2019-01-01 1 3 1 app 2675 117
## 2 14 2019-01-01 1 3 1 app 729 16
## 3 14 2019-01-01 1 3 2 app 1016 33
## 4 14 2019-01-01 1 3 2 app 342 6
## 5 14 2019-01-01 1 3 3 app 3431 92
## 6 14 2019-01-01 1 3 3 app 864 9
## ctr
## 1 0.04373832
## 2 0.02194787
## 3 0.03248031
## 4 0.01754386
## 5 0.02681434
## 6 0.01041667
```

*NB : Of course, you can use your own data. The format you need to have is the following:*

a

*dataframe*type is needed (ex:**adcampaign**from binomialMix)a column with

*factor*id representing the objects you want to cluster (ex:**id**from adcampaign )a target value (ex:

**ctr**from adcampaign)a weighted value variable as we are in case of binomial data (ex:

**impressions**from adcampaign)at least, one column as

*explicative variable*(ex:**day**from adcampaign)

The objective of the study is to group advertising campaigns into clusters. We observe by campaign, time slot, day of week and ad slot campaign (like app or site) the observed number of clicks and impressions. CTR corresponds to the number of click on the number of impressions. CTR value differs a lot from one observation to another, as well as the total length of a campaign. Some last fews days and others broadcast for months. Then, each campaigns (**column “id”**) is composed of n_c observations from the whole dataset and we have repeated mesure for a same id level. The available explicative variables are:

day

timeSlot

app_or_site

Let’s now try to cluster our dataset into K groups.

```
# The dataframe to cluster:
df_tocluster<-adcampaign
# We choose two explainable variables:
model_formula<-"ctr~timeSlot+day"
# As we are in a case of binomial mixture model, we define the weighted variable
weighted_variable<-"impressions"
# We want to analyse results for K=3.
K<-3
# We define the individual to cluster:
col_id<-"id"
set.seed(1992)
# We run our EM algorithm developped for mixture of binomial and longitudinal dataset:
result_K3<-runEM(model_formula,
weighted_variable,
K,
df_tocluster,
col_id)
```

The output of the runEM function provides the following values:

Loglikelihood for each EM iteration

Estimation of model parameters (

*β*,*λ*,*π*)BIC and ICL values

Number of fisher iteration needed for each M-Step

**Plotting evolution of Loglikelihood over iteration**

```
library(ggplot2)
qplot(seq_along(result_K3[[1]]), result_K3[[1]],
xlab="Number of EM iterations",
ylab="Loglikelihood")
```

**Estimated β parameters**

Let’s have a look at the estimated parameters for each cluster k. We only show the estimation from the last EM iteration in the following.

`result_K3[[3]][[length(result_K3[[3]])]]`

```
## k=1 k=2 k=3
## [1,] -3.27524617 -6.02001952 -5.0421272
## [2,] 0.31581767 0.12798712 0.4729844
## [3,] 0.20178096 0.26338824 0.5358001
## [4,] 0.26372976 0.43174257 0.7074703
## [5,] 0.09812297 0.41739277 0.8413444
## [6,] 0.08388539 0.09588596 0.7098054
```

**Estimated proportion of campaigns λ for each cluster**

We want to have a look at the repartition of our campaigns for adcampaign dataset to analyze the size of each cluster. We only display value for the last iteration of EM algorithm.

`result_K3[[3]][[length(result_K3[[3]])]]`

`## [1] 0.114300 0.498075 0.387625`

**Matrix of proability for each campaign to belong to the different clusters**

We analyze the contribution of each campaign to the K clusters. The columns define the campaigns and the rows the different cluster k.

```
# We only display the results for the first 10 campaigns (10 columns)
set.seed(1992)
result_K3[[4]][[length(result_K3[[4]])]][,1:10]
```

```
## ID_1 ID_2 ID_3 ID_4 ID_5 ID_6 ID_7 ID_8 ID_9 ID_10
## k=1 0 0 0 0 0 0 0 0.000 0.000 0
## k=2 0 0 1 0 0 0 1 0.999 0.096 1
## k=3 1 1 0 1 1 1 0 0.001 0.904 0
```

**Analyze of BIC and ICL values**

The analyze of BIC and ICL values is essential when we want to choose the right number of clusters. We can compare BIC/ICL values and choose the K that minimize one or both of these criteria.

```
result_K3[[5]][[length(result_K3[[5]])]] # BIC value
result_K3[[6]][[length(result_K3[[6]])]] # ICL value
```

`## [1] "BIC=372360.14"`

`## [1] "ICL=372367.72"`

**Analyze of Fisher scoring number of iterations for each M step**

If we want to know the number of Fisher scoring iterations at each M step, we can display the following matrix.

`matrix(unlist(result_K3[[7]]),ncol=length(result_K3[[7]])-1)`

```
## iter_1 iter_2 iter_3 iter_4 iter_5 iter_6 iter_7 iter_8
## k=1 4 3 3 2 1 1 1 1
## k=2 4 3 1 3 2 1 1 1
## k=3 4 3 3 2 2 1 1 1
```