Adaptive subspace methods for high-dimensional variable selection

  • Adaptive Subspace Methoden für hoch-dimensionale Variablenselektion

Staerk, Christian; Kateri, Maria (Thesis advisor); Ntzoufras, Ioannis (Thesis advisor); Cramer, Erhard (Thesis advisor)

Aachen (2018)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2018

Abstract

Due to recent advancements in fields such as information technology and genomics, nowadays one commonly faces high-dimensional data where the number of explanatory variables is possibly much larger than the number of observations. In such situations one is particularly interested in variable selection, meaning that one aims at identifying a sparse model with a relatively small subset of variables that fits and ideally explains the observed data well. This thesis deals with the variable selection problem in the setting of high-dimensional generalized linear models (GLMs). While many variable selection methods like the Lasso are based on solving convex $\ell_1$-type relaxations of the original problem, a main motive of this work is the desire to provide solutions to generally NP-hard $\ell_0$-regularized problems induced by model selection criteria such as the Extended Bayesian Information Criterion (EBIC). For this purpose, the Adaptive Subspace (AdaSub) method is proposed which is based on the idea of adaptively solving several low-dimensional sub-problems of the original high-dimensional problem. AdaSub is a stochastic algorithm which sequentially adapts the sampling probabilities of the individual variables based on their currently estimated "importance". It is shown that the updating scheme of AdaSub can be motivated in a Bayesian way and that the method "converges correctly" against the best model according to the employed criterion, provided that the so-called ordered importance property (OIP) is satisfied. Furthermore, the variable selection consistency of AdaSub is proved under suitable conditions. Since solving the sampled sub-problems can be computationally expensive for GLMs different than the normal linear model, "greedy" modifications of AdaSub are introduced which provide approximate solutions to the sub-problems. It is argued that BackAdaSub, a version of AdaSub based on Backward Stepwise Selection, may be used as an efficient surrogate algorithm. The "correct convergence" of BackAdaSub can be guaranteed under the modified ordered importance property (MOIP), which is a stronger condition than the original OIP. The performance of AdaSub and BackAdaSub in comparison to other prominent competitors such as the Lasso, the Adaptive Lasso, the SCAD and Stability Selection is investigated via various simulated and real data examples in the framework of linear and logistic regression models. Finally, a Metropolized version of AdaSub, called the MAdaSub algorithm, is proposed for sampling from posterior model distributions in the Bayesian variable selection context. MAdaSub is an adaptive Markov Chain Monte Carlo (MCMC) algorithm which sequentially adjusts the proposal distribution based on the information from the previously sampled models. It is shown that the MAdaSub algorithm is ergodic despite its continuing adaptation, i.e. "in the limit" it samples from the correct target distribution. Through simulated and real data examples it is demonstrated that MAdaSub can provide stable estimates of posterior marginal inclusion probabilities even for very high-dimensional and multimodal posterior model distributions.