Highdimensional semisupervised learning: in search of optimal inference for the meanWe provide a highdimensional semisupervised inference framework focused on the mean and variance of the response. Our data are comprised of an extensive set of observations regarding the covariate vectors and a much smaller set of labeled observations where we observe both the response as well as the covariates. We allow the size of the covariates to be much larger than the sample size and impose weak conditions on a statistical form of the data. We provide new estimators of the mean and variance of the response that extend some of the recent results presented in lowdimensional models. In particular, at times we will not necessitate consistent estimation of the functional form of the data. Together with estimation of the population mean and variance, we provide their asymptotic distribution and confidence intervals where we showcase gains in efficiency compared to the sample mean and variance. Our procedure, with minor modifications, is then presented to make important contributions regarding inference about average treatment effects. We also investigate the robustness of estimation and coverage and showcase widespread applicability and generality of the proposed method.
with Yuqian Zhang submitted 
Censored quantile regression forests 
Random forests are powerful nonparametric regression method but are severely limited in their usage in the presence of randomly censored observations, and naively applied can exhibit poor predictive performance due to the incurred biases. Based on a local adaptive representation of random forests, we develop its regression adjustment for randomly censored regression quantile models. Regression adjustment is based on new estimating equations that adapt to censoring and lead to quantile score whenever the data do not exhibit censoring. The proposed procedure named {\it censored quantile regression forest}, allows us to estimate quantiles of timetoevent without any parametric modeling assumption. We establish its consistency under mild model specifications. Numerical studies showcase a clear advantage of the proposed procedure.
with Alexander Hanbo Li submitted 
Confidence intervals for highdimensional Cox modelThe purpose of this paper is to construct confidence intervals for the regression coefficients in highdimensional Cox proportional hazards regression models where the number of covariates may be larger than the sample size. Our debiased estimator construction is similar to those in Zhang and Zhang (2014) and van de Geer et al. (2014), but the timedependent covariates and censored risk sets introduce considerable additional challenges. Our theoretical results, which provide conditions under which our confidence intervals are asymptotically valid, are supported by extensive numerical experiments.
with Richard J. Samworth and Yi Yu revision at Statistica Sinica 
Testability of highdimensional linear models with nonsparse structuresThis paper studies hypothesis testing and confidence interval construction in highdimensional linear models with possible nonsparse structures. For a given component of the parameter vector, we show that the difficulty of the problem depends on the sparsity of the corresponding row of the precision matrix of the covariates, not the sparsity of the model itself. We develop new concepts of uniform and essentially uniform nontestability that allow the study of limitations of tests across a broad set of alternatives. Uniform nontestability identifies an extensive collection of alternatives such that the power of any test, against any alternative in this group, is asymptotically at most equal to the nominal size, whereas minimaxity shows the existence of one particularly "bad" alternative. Implications of the new constructions include new minimax testability results that in sharp contrast to existing results do not depend on the sparsity of the model parameters. We identify new tradeoffs between testability and feature correlation. In particular, we show that in models with weak feature correlations minimax lower bound can be attained by a confidence interval whose width has the parametric rate regardless of the size of the model sparsity.
with Jianqing Fan and Yinchu Zhu, revision at AOS 
Testing in highdimensional linear mixed models
Many scientific and engineering challenges  ranging from pharmacokinetic drug dosage allocation and personalized medicine to marketing mix (4Ps) recommendations  require an understanding of the unobserved heterogeneity in order to develop the best decision makingprocesses. In this paper, we develop a hypothesis test and the corresponding pvalue for testing for the significance of the homogeneous structure in linear mixed models. A robust matching moment construction is used for creating a test that adapts to the size of the model sparsity. When unobserved heterogeneity at a cluster level is constant, we show that our test is both consistent and unbiased even when the dimension of the model is extremely high. Our theoretical results rely on a new family of adaptive sparse estimators of the fixed effects that do not require consistent estimation of the random effects. Moreover, our inference results do not require consistent model selection. We showcase that moment matching can be extended to nonlinear mixed effects models and to generalized linear mixed effects models. In numerical and real data experiments, we find that the developed method is extremely accurate, that it adapts to the size of the underlying model and is decidedly powerful in the presence of irrelevant covariates.
with Gerda Claeskens and Thomas Gueuning, revision requested by Journal of the American Statistical Association: T&M 
FineGray competing risks model with high dimensional covariates: estimation and inferenceThe purpose of this paper is to construct confidence intervals for the regression coefficients in the FineGray model for competing risks data with random censoring, where the number of covariates can be larger than the sample size. Despite strong motivation from biostatistics applications, highdimensional FineGray model has attracted relatively little attention among the methodological or theoretical literatures. We fill in this blank by proposing first a consistent regularized estimator and then the confidence intervals based on the onestep biascorrecting estimator. We are able to generalize the partial likelihood approach for the FineGray model under random censoring despite many technical difficulties. We lay down a methodological and theoretical framework for the onestep biascorrecting estimator with the partial likelihood, which does not have independent and identically distributed entries. We also handle for our theory the approximation error from the inverse probability weighting (IPW), proposing novel concentration results for time dependent processes. In addition to the theoretical results and algorithms, we present extensive numerical experiments and an application to a study of noncancer mortality among prostate cancer patients using the linked MedicareSEER data.
with Ronghui Xu and Jue Hou, minor revision at the Electronic Journal of Statistics 
Breaking the curse of dimensionalityModels with many signals, highdimensional models, often impose structures on the signal strengths. The common assumption is that only a few signals are strong and most of the signals are zero or close (collectively) to zero. However, such a requirement might not be valid in many reallife applications. In this article, we are interested in conducting largescale inference in models that might have signals of mixed strengths. The key challenge is that the signals that are not under testing might be collectively nonnegligible (although individually small) and cannot be accurately learned. This article develops a new class of tests that arise from a moment matching formulation. A virtue of these momentmatching statistics is their ability to borrow strength across features, adapt to the sparsity size and exert adjustment for testing growing number of hypothesis. GRouplevel Inference of Parameter, GRIP, test harvests effective sparsity structures with hypothesis formulation for an efficient multiple testing procedure. Simulated data showcase that GRIPs error control is far better than the alternative methods. We develop a minimax theory, demonstrating optimality of GRIP for a broad range of models, including those where the model is a mixture of a sparse and highdimensional dense signals.
with Yinchu Zhu, revision requested by the Journal of the Machine Learning Research 
A projection pursuit framework for testing

Comment on "High dimensional

Uniform inference for highdimensional

Twosample testing in nonsparse

Linear hypothesis testing in dense

Highdimensional inference in linear models:

Robust confidence intervals in highdimensional

Boosting in the presence of outliers:

Robustness in sparse linear models:

Randomized Maximum Contrast Selection:

Structured Estimation in NonParametric Cox ModelIn this paper, we study theoretical properties of the nonparametric Cox proportional hazards model in a high dimensional nonasymptotic setting. We establish the finite sample oracle l2 bounds for a general class of group penalties that allow possible hierarchical and overlapping structures. We approximate the log partial likelihood with a quadratic functional and use truncation arguments to reduce the error. Unlike the existing literature, we exemplify differences between bounded and possibly unbounded nonparametric covariate effects. In particular, we show that bounded effects can lead to prediction bounds similar to the simple linear models, whereas unbounded effects can lead to larger prediction bounds. In both situations we do not assume that the true parameter is necessarily sparse. Lastly, we present new theoretical results for hierarchical and smoothed estimation in the nonparametric Cox model. We provide two examples of the proposed general framework: a Cox model with interactions and an ANOVA type Cox model.
with Rui Song, Electronic Journal of Statistics (2015), 9(1), p.492534 
Cultivating Disaster Donors Using Data AnalyticsNonprofit organizations use directmail marketing to cultivate onetime donors and convert them into recurring contributors. Cultivated donors generate much more revenue than new donors, but also lapse with time, making it important to steadily draw in new cultivations. We propose a new empirical model based on importance subsample aggregation of a large number of penalized logistic regressions. We show via simulation that a simple design strategy based on these insights has potential to improve success rates from 5.4% to 8.1%.
with Ilya Ryzhov and Bin Han, Management Science (2016), 62 (3), p. 849866 
Regularization for Cox's proportional

Composite QuasiLikelihood
