Speakers

Jun Liu, Harvard University

Biography: Jun Liu (Chinese: 刘军) is Professor of Statistics at Harvard University, with a joint appointment in Harvard School of Public Health. Dr. Liu received his BS degree in mathematics in 1985 from Peking University and Ph.D. in statistics in 1991 from the University of Chicago. He held Assistant, Associate, and full professor positions at Stanford University from 1994 to 2003. Dr. Liu received the NSF CAREER Award in 1995 and the Mitchell Award in 2000. He was selected as a Medallion Lecturer in 2002, a Bernoulli Lecturer in 2004, a Kuwait Lecturer of Cambridge University in 2008; and elected to Fellow of the Institute of Mathematical Statistics and Fellow of the American Statistical Association in 2004 and 2005, respective. In 2002, he won the prestigious COPSS Presidents' Award (given annually to one individual under age 40). In 2010, he was awarded the Morningside Gold Medal in Applied Mathematics (once every 3 years to an individual of Chinese descent under age 45). He was honored with the Outstanding Achievement Award in 2012, and the Pao-Lu Hsu Award (once every 3 years) in 2016 by the International Chinese Statistical Association.  In 2017, he was recognized by the Jerome Sacks Award for outstanding Cross-Disciplinary Research.

Dr. Liu and his collaborators introduced the statistical missing data formulation and Gibbs sampling strategies for biological sequence motif analysis in early 1990s. The resulting algorithms for protein sequence alignments, gene regulation analyses, and genetic studies have been adopted by many researchers as standard computational biology tools. Dr. Liu has made fundamental contributions to statistical computing and Bayesian modeling. He pioneered sequential Monte Carlo (SMC) methods and invented novel Markov chain Monte Carlo (MCMC) techniques. His theoretical and methodological studies on SMC and MCMC algorithms have had a broad impact in many areas. Dr. Liu has also pioneered novel Bayesian modeling techniques for discovering nonlinear and interactive effects in high-dimensional data and led the developments of theory and methods for sufficient dimension reduction in high-dimensions. Dr. Liu has served on numerous government’s grant review panels and editorial boards of leading statistical journals, including the co-editorship of JASA from 2011-2014. Dr. Liu has published more than 270 research articles in leading scientific journals and books, graduated more than 35 PhD students and mentioned 30 postdoctoral fellows.

 

Title: Data Splitting for Linear and Generalized Linear Models Selection With FDR Control

Abstract: Simultaneously finding multiple influential variables and controlling the false discovery rate (FDR) for statistical and machine learning models is a problem of renewed interest recently. A classical statistical idea is to introduce perturbations and examine their impacts on a statistical procedure. We here explore the use of data splitting (DS) for controlling FDR in learning linear, generalized linear, and graphical models. Our proposed DS procedure simply splits the data into two halves at random, and computes a statistic reflecting the consistency of the two sets of parameter estimates (e.g., regression coefficients). The FDR control can be achieved by taking advantage of such a statistic, which possesses the property that, for any null feature its sampling distribution is symmetric about 0. Furthermore, by repeated sample splitting, we propose Multiple Data Splitting (MDS) to stabilize the selection result and boost the power. Interestingly, MDS not only helps overcome the power loss caused by DS with the FDR still under control, but also results in a lower variance for the estimated FDR compared with all other considered methods.  DS and MDS are straightforward conceptually, easy to implement algorithmically, and efficient computationally.  Simulation results as well as a real data application show that both DS and MDS control the FDR well and MDS is often the most powerful method among all in consideration, especially when the signals are weak and correlations or partial correlations are high among the features. Our preliminary tests on nonlinear models such as generalized linear models and neural networks also show promises.

The presentation is based on joint work with Chenguang Dai, Buyu Lin, and Xin Xing.