The MXM R Package, short for the latin 'Mens ex Machina' ( Mind from the Machine ), is a collection of utility functions for feature selection, cross validation and Bayesian Networks. The package supports conditional independence tests for various combinations of target and predictor variables (continuous, categorical).
MXM offers many feature selection algorithms focused on providing one or more minimal feature subsets, refered also as variable signatures, that can be used to improve the performance of downstream analysis tasks such as regression and classification, by excluding irrelevant and redundant variables. In this tutorial we will learn how to use the MMPC algorithm.
MMPC stands for Max-Min Parents and Children, a constraint based feature selection algorithm, as first described by Brown, Tsamardinos and Aliferis, (2004). Parents and Children refers to the fact that the algorithm identifies the parents and children of the variable of interest ( target ), assuming a Bayesian Network for all observed variables. What it will not recover is the spouses of the children , and for this the FBED algorithmcan be applied The FBED algorithm is also available in the MXM R package and can essentially recover the Markov Blanket of the variable of interest. For simplicity, we will use a dataset with fewer variables, but the algorithms perform especially well in datasets with very large feature space, such as in biomedical datasets (eg. millions of SNPs as variables in GWAS studies, thousands of genes in NGS omics datasets, etc).
At its core, the MMPC algorithm performs multiple conditional independance tests, and progressively excludes irrelevant and/or redundant variables. The final variables that have “survived” through all those elimination stages, are the MMPC output signature.
The selection of the appropriate conditional independence test is a crucial decision for the validity and success of downstream statistical analysis and machine learning tasks. Currently the MXM R package supports nummerous tests for different combinations of target ( dependent ) and predictor ( independent ) variables. A detailed summary table to guide you through the selection of the most suitable test is included in MXM's reference manual ( “CondInditional independence tests” )
MMPC()function: Required and optional arguments
There are 3 mandatory arguments for the
MXM::MMPC() function: 1) an object with the target variable, 2) one with the complete dataset but with the target variable removed and 3) a conditional indepence test, selected by the reference manual table mentioned above. The
dataset has to have the instances (N) as rows and the features (f) as columns. For the
dataset it is recommended to also retain
Several hyper-parameters are also provided as optional arguments in the
MXM::MMPC() function. In the following block, the function along with the most important hyperparameters are presented.
# Overview the MXM::`MMPC()` function mod <- MXM::MMPC( target, # The target variable vector dataset, # The dataset with the target column removed max_k = 3, # The maximum size of the conditioning set to use threshold = 0.05, # level of alpha for statistical significance test = 'testIndFisher', ini = NULL, # if TRUE, the calculated univariate associations # are stored for runtime efficiency in subsequent # MMPC runs with diferent hyper-parameters. hash = TRUE, # if TRUE, the calculated statistics are stored. hashObject = NULL, # the mmpcobject from a previous run ncores = 1, # number of cores for parallel execution. # Recommended for thousands of variables. backward = TRUE) # If TRUE, the backward phase # (or symmetry correction) is implemented. # Falsely included variables, # in the MMPC output signature are removed.
For this tutorial we will use the UCI wine dataset. The dataset contains the results of a chemical analysis performed on 3 different types of wines and includes 12 quality related characteristics plus the information of the wine class as the first attribute (Type: 1,2 or 3). More information about the wine dataset is available at the UCI repository.
The following block installs the MXM package, takes care of package dependencies, downloads and then cleans the dataset for the subsequent steps. The categorical variable (class information) is omitted for this example and we retain only the numerical variables (continuous and count data). We will then apply the MMPC algorithm to acquire a minimal, highly relevant subset of variables that can be used to best model the
PACKAGE DEPENDENCIES and UCI WINE DATASET LOADING:
# 0. INSTALL and LOAD the MXM R Package: #install.packages('MXM', dependencies = TRUE ) library(MXM) # 1. DOWNLOAD the wine dataset from UCI: URL <- "ftp://ftp.ics.uci.edu/pub/machine-learning-databases/wine/wine.data" wine <- read.csv(URL, header = FALSE) # 2. SET variables' names as header: colnames(wine) <- c('Type', 'Alcohol', 'Malic', 'Ash', 'Alcalinity', 'Magnesium', 'Phenols', 'Flavanoids', 'Nonflavanoids', 'Proanthocyanins', 'Color', 'Hue', 'Dilution', 'Proline') # 3. REMOVE the 1st attribute, which is the class information: wine <- wine[,-1] # 4. PREVIEW UCI's wine dataset: head(wine, 2)
## Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids ## 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 ## 2 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 ## Proanthocyanins Color Hue Dilution Proline ## 1 2.29 5.64 1.04 3.92 1065 ## 2 1.28 4.38 1.05 3.40 1050
# The header should include the wine attributes sans the class labels, # in the following order: # Alcohol | Malic | Ash | Alcalinity | Magnesium | Phenols | Flavanoids # Nonflavanoids | Proanthocyanins | Color | Hue | Dilution | Proline
# 5. CHECK for missing or non-numeric values in the dataframe: sum(is.na(wine))
##  0
sum(is.nan(as.matrix(wine))) #if 0, then No NAs, none NaNs, good to go!
##  0
Even if the dataset contains missing values, they will automatically be replaced by the current variable (column) mean value with an appropriate warning to the user after the execution. For optimal results, it is advised to use a more sophisticated imputation method according to your data's needs before running
# 6. CHECK `wine` object's data type, dimensions: str(wine)
## 'data.frame': 178 obs. of 13 variables: ## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... ## $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... ## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... ## $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ... ## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ... ## $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ... ## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ... ## $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ... ## $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ... ## $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ... ## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ... ## $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ... ## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
# The output should be a datarame: #'data.frame': 178 obs. of 13 variables
MMPC()input objects: 'target', 'dataset'
And now, we will tailor the
target ( non-flavanoids content variable ) and the complete
dataset objects to the the
MMPC()'s function needs. Both will be converted to matrices, using the built in function
as.matrix() and we will make sure that the dataset matrix is given as instances (N) by features (f). After selecting the target variable, create a matrix that includes only the remaining variables. This would be the
dataset input for the
MMPC() function. This is necessary to assure that the signature does not include the target variable.
# 0. Exclude target variable column targetVariable <- wine$Nonflavanoids targetVariable <- NULL # 1. Convert dataframe to matrix: wine_dataset <- as.matrix(wine[, -8]) wine_dataset[, 12] <- as.numeric(wine_dataset[, 12]) head(wine_dataset, 2)
## Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Proanthocyanins ## [1,] 14.23 1.71 2.43 15.6 127 2.80 3.06 2.29 ## [2,] 13.20 1.78 2.14 11.2 100 2.65 2.76 1.28 ## Color Hue Dilution Proline ## [1,] 5.64 1.04 3.92 1065 ## [2,] 4.38 1.05 3.40 1050
We check once more the dimension of our
dataset object. We expect it to be in the form Instances x Features.
# 2. Check dimensions of the wine_dataset # REMINDER: We need it as N x f // N for instances, f or features dim(wine_dataset)
##  178 12
# The output should be 178 x 12, #178 instances and 12 features; if so, we're good to go
If desired, you can allocate the
target variable in a new data structure, merely for the purpose of keeping functions' input easy to read.
# 3. Select the target variable (`Nonflavanoids`) and store as a matrix: target_NonFlav <- as.vector(wine$Nonflavanoids) str(target_NonFlav,2)
## num [1:178] 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
After the brief EDA (Exploratory Data Analysis) of the UCI wine dataset, it's time to actually apply
For a comprehensive overview of the hyper-parameter options type
?MMPC in your Rstudio console. Here, we have set
hash = TRUE, for faster runtimes in subsequent runs of the MMPC with different hyper-parameters. We have selected the
testIndFisher test which is the appropriate for our data, that we have retrieved from the Conditional Independece Tests cheatsheet from the MXM reference manual. The
backward = TRUE was also used, for acquiring a minimal signature. All other parameters were left to default or the first run.
# MMPC on the wine dataset: library('MXM') mmpcobject_wine_NonFlav <- MXM::MMPC( target = target_NonFlav, dataset = wine_dataset, max_k = 3, threshold = 0.05, test = 'testIndFisher', ini = NULL, hash = TRUE, hashObject = NULL, ncores = 1, backward = TRUE)
mmpcobjectand the feature signature
Let's now explore the output of the
MMPC function, the mmpcobject. The generic function
summary() can be used to display the indices of the features that are included in the signature, and also contains information about the
MMPC() run, such as the selected hyperparameters, execution runtime and also statistics about the distribution of the p-values from the performed tests. For more specific information, you can access
the mmpcobject fields with the
@ operator, the operator for accessing S4 class objects in R . This is essentially the same as using the dollar operator
$ for accessing R5 class objects ' slots, typically used with dataframes for example. The fields contain the output results of the
MMPC() run and also two lists that can be re-used for subsequent MMPC runs for computation time efficiency.
Below, we can see the two objects that facilitate the computation time efficiency in the
MMPC() re-runs. After running MMPC_ with some hyper-parameters you might want to run the algorithm again with different hyper-parameters (
max_k for example). To avoid calculating the univariate associations (first step of MMPC) again, you can take the list
univ from the first run and provide it as input to the argument
ini in the subsequent runs the algorithm. This can speed up the second run (and subequent runs of course) up to 50%, which is crucial if you are handling datasets with a very high number of features.
# Cache of the stats calculated in the MMPC run str([email protected])
## List of 2 ## $ stat_hash :Class 'Hash' <environment: 0x0000000019339ed8> ## $ pvalue_hash:Class 'Hash' <environment: 0x0000000019336460>
# a list with the univariate associations str([email protected])
## List of 2 ## $ stat : num [1:12] -2.08 3.99 2.49 5.01 -3.47 ... ## $ pvalue: num [1:12] -3.24 -9.25 -4.3 -13.56 -7.32 ...
Let's also save the
runtime of the first MMPC run in a variable, to compare later.
execution_time_1st_MMPC_run <- [email protected] execution_time_1st_MMPC_run
## user system elapsed ## 0.02 0.00 0.03
MMPC()subsequent run: Efficiency with the
Now, we can run
MMPC() again to check how we can use the
univ lists to maximize runtime efficiency when multiple runs are requirted.
# MMPC on the wine dataset: library('MXM') mmpcobject_2nd_run <- MXM::MMPC(target = target_NonFlav, dataset = wine_dataset , # it was set to 3 in the 1st run max_k = 5, # it was set to 0.05 in the 1st run threshold = 0.01, test = 'testIndFisher', #the cached univariate tests ini = [email protected], # cached stats, p-values hashObject = [email protected])
We used the
stored stats, p-values and univariate tests performed in the first run for avoiding redundant calculations.
Our example dataset is very small to highlight the impact of such an implementation in the algorithm, but let's compare the runtimes for first and second run of
execution_time_2nd_MMPC_run <- [email protected] execution_time_1st_MMPC_run
## user system elapsed ## 0.02 0.00 0.03
## user system elapsed ## 0 0 0
Even in our small wine dataset, the difference in runtime is impressive.
MMPC is designed for, and actually shines in high feature space datasets, such as those in the domains of computer vision and -omics approaches in Life Sciences.
In the spirit of automation and performance efficiency, it would be an impossible task to manually search for the optimal set of hyper-parameters. Thus, the
MXM package supports a grid search function, named
mmpc.path. The function returns an object that includes matrices with the following information that can be used for selecting the optimal configuration.
bic: matrix with the BIC values of the final fitted model based on the selected variables identified by each combination of the hyper-parameters.
size: matrix with the legnth of the selected variables identified by each configuration of MMPC.
variables: A list containing the variables from each configuration of MMPC
runtime: The run time of the function
The desired hyperparameters to be checked can be given as vectors in the relative argument.
# Grid Search for MMPC hyper-parameter tuning library('MXM') mmpcGridSearch <- MXM::mmpc.path(target = target_NonFlav, dataset = wine_dataset, max_ks = c(3,4,5,6), # a vector of k to try alphas = NULL, # a vector of thresholds; # If NULL, 0.1, 0.05 and 0.01 # will be tested. test = 'testIndFisher', ncores = 1)
We can acces the
mmpc.path() objects using the dollar
$ operator and then use the generic
which function to retrieve the lowest value for BIC and the repsective hyper-parameters.
BIC_results <- as.data.frame(mmpcGridSearch$bic) head(BIC_results, 4)
## max_k=6 max_k=5 max_k=4 max_k=3 ## alpha=0.1 -288.7666 -288.7666 -288.7666 -288.7666 ## alpha=0.05 -288.7666 -288.7666 -288.7666 -288.7666 ## alpha=0.01 -286.5281 -286.5281 -286.5281 -286.5281
# We can retrieve the indices of the minimum BIC values: which(BIC_results == min(BIC_results), arr.ind = TRUE)
## row col ## alpha=0.1 1 1 ## alpha=0.05 2 1 ## alpha=0.1 1 2 ## alpha=0.05 2 2 ## alpha=0.1 1 3 ## alpha=0.05 2 3 ## alpha=0.1 1 4 ## alpha=0.05 2 4
Above we observe that the highest alpha level, 0.1, is the one with the lowest BIC. However, since the difference is miniscule in the BIC change, we will select a lower alpha level, to avoid including false positives in our model.
We can display also the
size of the selected signatures, which is the number variables the signature of ech configuration contains.
size_of_signature_results <- as.data.frame(mmpcGridSearch$size) head(size_of_signature_results, 4)
## max_k=6 max_k=5 max_k=4 max_k=3 ## alpha=0.1 4 4 4 4 ## alpha=0.05 4 4 4 4 ## alpha=0.01 2 2 2 2
# We can retrieve the indices of the maximum subset: which(size_of_signature_results == max(size_of_signature_results), arr.ind = TRUE)
## row col ## alpha=0.1 1 1 ## alpha=0.05 2 1 ## alpha=0.1 1 2 ## alpha=0.05 2 2 ## alpha=0.1 1 3 ## alpha=0.05 2 3 ## alpha=0.1 1 4 ## alpha=0.05 2 4
We can also preview, the indices of the actual variables that were included in the signature in each configuration:
## $`alpha=0.1 & max_k=6` ##  4 5 7 11 ## ## $`alpha=0.1 & max_k=5` ##  4 5 7 11 ## ## $`alpha=0.1 & max_k=4` ##  4 5 7 11 ## ## $`alpha=0.1 & max_k=3` ##  4 5 7 11
Let's now get back at our initial
MMPC() run, and explore the output summary:
## Length Class Mode ## 1 MMPCoutput S4
We can retrieve the selected MMPC signature, by choosing
selectedVarsOrder from the MMPC output object; this way, the features are printed based on highest statistical significance.
##  7 4 5 11
# The signature should include the variables with indices 7, 4, 5
We can retrieve the names of the features in the signature by selecting the
colnames with the above indices from the
dataset provided as input in the
##  "Flavanoids"
##  "Alcalinity"
##  "Magnesium"
MMPC() signature, highlights the
" Magnesium" content as variables of high importance in relation to our selected target variable
We can actually create a regression model, using the above signature variables and check how the model performs. We will call the
mmpc.model function and use the
mmpcObject from the
MMPC() run above, as input. For a more detailed overview the the function
mmpc.model(), you can type
"?mmpc.model" in your rstudio console.
# MODEL ESTIMATES USING MMPC'S FEATURE SUBSET AS PrEDICTORs mmpcmodel_wine_NonFlav<- mmpc.model( target = target_NonFlav, dataset = wine_dataset, wei = NULL, mmpcObject = mmpcobject_wine_NonFlav, test = 'testIndFisher') summary(mmpcmodel_wine_NonFlav) ;
## Length Class Mode ## mod 12 lm list ## signature 5 -none- numeric
The MMPC algorithm follows a forward-backward filter approach for feature selection in order to provide a minimal, highly-predictive feature subset of a high dimensional dataset. The
max_k hyper-parameter dictates the maximum number of variables as a conditioning set to use in the conditioning independence test. In our example, 12 model where built in the process, amongst them also the one with the final selected features as variables.
ypografi variable denotes the indices of the MMPC selected features and also the BIC (Bayesian Information Criterion) of the final model. If you are not familiar with BIC as a model selection criterion, you can type
?BIC in your Rstudio console for a brief introduction. As a rough estimate, when comparing models of the same Y as target variable, the model with the lowest BIC is prefered. Below, we will retrieve the beta Coefficients and the intercept for the selected model, so that we can write the actual formula for the regression model, for the
"NonFlavanoids" as target variable.
## ## Call: ## lm(formula = target ~ ., data = as.data.frame(dataset[, signature]), ## weights = wei) ## ## Coefficients: ## (Intercept) Flavanoids Alcalinity Magnesium Dilution ## 0.549466 -0.029591 0.007240 -0.001543 -0.043972
"NonFlavanoids" target variable, be Y, then the formula for the model can be written as follows:
Y = -0.087194 + 0.032057 Alcalinity -0.006664 Magnesium -0.236471 Flavanoids
MXM package offers a permutation version of the statistical tests, which is recommended for small sample size datasets.
If you are working with very large datasets, in order to save computational time, there is a trick to avoind doing all permutations. As soon as the number of times the permuted test statistic is more than the observed test statistic is more than 50 (if threshold = 0.05 and R = 999), the p-value has exceeded the signifiance level (threshold value) and hence the predictor variable is not significant. There is no need to continue do the extra permutations, as a decision has already been made.
has the same arguments as theMMPC()
function, with an additional option calledR which is the number of permutations.
# TESTS WITH PERMUTATIONS: library('MXM') permutation_MMPC_model <- MXM::perm.mmpc (target_NonFlav, wine_dataset, R = 999, # Number of permutations max_k = 3, threshold = 0.05, test = 'permFisher', ncores = 1, backward = FALSE)
max_kfor the conditioning set
As a rule of thumb, when the sample size is rather small eg. < 100, the default
max_k = 3 is recommended. In small feature space datasets, as for example in the case of the wine dataset the default option is also recommended. When working with high dimensional datasets (many hundreds to thousands of features), the following approximation can be used:
max_k = floor(N/10)
where N is the number of instances.
For example, in a gene expression dataset, where N = 100 and features = 50000, max_k = 100/10 = 10, could be used. For more information about this heuristic there is an interesting topic in stackoverflow to search.
You can use more than one cores to speed up execution time. However, this only pays off when working with large datasets. For small datasets, like in our example
ncore = 1 is recommended.
You are now ready to apply the
MMPC() algorithm and explore your very own dataset! The
MXM is under intensive development and will be regularly updated with new functionalities. For questions, algorithm requests and suggestions, do not hesitate to contact us at [email protected].
Lagani, V., Athineou, G., Farcomeni, A., Tsagris, M. & Tsamardinos, I. Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets. J. Stat. Softw. 80, 1-25 (2017).
Borboudakis, G. & Tsamardinos, I. Forward-Backward Selection with Early Dropping. (2017).
Tsamardinos, Brown and Aliferis (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1), 31-78.
Brown, L. E., Tsamardinos, I., & Aliferis, C. F. (2004). A novel algorithm for scalable and accurate Bayesian network learning. Medinfo, 711-715.
Tsamardinos, I., Aliferis, C. F., & Statnikov, A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673-678). ACM.
Statnikov, A. R., Tsamardinos, I., & Aliferis, C. F. (2003). An Algorithm For Generation of Large Bayesian Networks.
## R version 4.0.3 (2020-10-10) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 19041) ## ## Matrix products: default ## ## locale: ##  LC_COLLATE=C LC_CTYPE=Greek_Greece.1253 ##  LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C ##  LC_TIME=Greek_Greece.1253 ## system code page: 1252 ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  MXM_1.5.1 ## ## loaded via a namespace (and not attached): ##  tidyr_1.1.2 foreach_1.5.1 jsonlite_1.7.2 ##  splines_4.0.3 R.utils_2.10.1 Formula_1.2-4 ##  ucminf_1.1-4 statmod_1.4.35 latticeExtra_0.6-29 ##  slam_0.1-48 numDeriv_2016.8-1.1 pillar_1.4.7 ##  backports_1.2.0 lattice_0.20-41 quantreg_5.75 ##  glue_1.4.2 digest_0.6.27 RColorBrewer_1.1-2 ##  checkmate_2.0.0 minqa_1.2.4 colorspace_2.0-0 ##  htmltools_0.5.0 Matrix_1.2-18 R.oo_1.24.0 ##  Rfast2_0.0.8 conquer_1.0.2 pkgconfig_2.0.3 ##  broom_0.7.3 SparseM_1.78 relations_0.6-9 ##  purrr_0.3.4 scales_1.1.1 RANN_2.6.1 ##  jpeg_0.1-8.1 lme4_1.1-26 MatrixModels_0.4-1 ##  htmlTable_2.1.0 tibble_3.0.4 generics_0.1.0 ##  ggplot2_3.3.2 ellipsis_0.3.1 geepack_1.3-2 ##  nnet_7.3-14 survival_3.2-7 magrittr_2.0.1 ##  crayon_1.3.4 evaluate_0.14 ordinal_2019.12-10 ##  R.methodsS3_1.8.1 doParallel_1.0.16 R.cache_0.14.0 ##  nlme_3.1-149 MASS_7.3-53 foreign_0.8-80 ##  R.rsp_0.44.0 RcppZiggurat_0.1.6 tools_4.0.3 ##  coxme_2.2-16 data.table_1.13.4 energy_1.7-7 ##  lifecycle_0.2.0 matrixStats_0.57.0 stringr_1.4.0 ##  munsell_0.5.0 cluster_2.1.0 Rfast_2.0.1 ##  compiler_4.0.3 rlang_0.4.9 grid_4.0.3 ##  nloptr_188.8.131.52 iterators_1.0.13 rstudioapi_0.13 ##  bigmemory.sri_0.1.3 htmlwidgets_1.5.3 visNetwork_2.0.9 ##  base64enc_0.1-3 boot_1.3-25 codetools_0.2-16 ##  gtable_0.3.0 sets_1.0-18 bigmemory_4.5.36 ##  R6_2.5.0 gridExtra_2.3 knitr_1.30 ##  dplyr_1.0.2 bdsmatrix_1.3-4 Hmisc_4.4-2 ##  stringi_1.5.3 parallel_4.0.3 Rcpp_1.0.5 ##  vctrs_0.3.6 rpart_4.1-15 png_0.1-7 ##  tidyselect_1.1.0 xfun_0.19