Review Articles

Reinforced variable selection

Yuan Le ,

School of Mathematics and Statistics, Fuzhou University, Fuzhou, People's Republic of China

Yang Bai ,

School of Statistics and Data Science, Shanghai University of Finance and Economics, Shanghai, People's Republic of China

statbyang@mail.shufe.edu.cn

Fan Zhou

School of Statistics and Data Science, Shanghai University of Finance and Economics, Shanghai, People's Republic of China

Pages | Received 01 Nov. 2024, Accepted 01 Jun. 2025, Published online: 20 Jun. 2025,
  • Abstract
  • Full Article
  • References
  • Citations

Variable selection identifies the best subset of covariates when building the prediction model, among all possible subsets. In this paper, we propose a novel reinforced variable selection method, called ‘Actor-Critic-Predictor’. The actor takes an action to choose variables and the predictor evaluates the action based on a well-designed reward function, where the critic learns the reward baseline. We model the variable selection process as a multi-armed bandit and update the subset of selected variables using a natural policy gradient algorithm. We provide an analytical framework on how different errors impact the performance of our method theoretically. Large amounts of experiments on synthetic and real datasets show that the proposed framework is easily implemented and outperforms classical variable selection methods in a wide range of scenarios.

References

  • Agarwal, A., Kakade, S. M., Lee, J. D., & Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research22(98), 1–76.
  • Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. Advances in Neural Information Processing Systems21, 105–112.
  • Barbu, A., She, Y., Ding, L., & Gramajo, G. (2016). Feature selection with annealing for computer vision and big data learning. IEEE Transactions on Pattern Analysis and Machine Intelligence39(2), 272–286. https://doi.org/10.1109/TPAMI.2016.2544315
  • Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks5(4), 537–550. https://doi.org/10.1109/72.298224
  • Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics37(4), 1705–1732. https://doi.org/10.1214/08-AOS620
  • Boullé, M. (2007). Compression-based averaging of selective naive Bayes classifiers. The Journal of Machine Learning Research8, 1659–1685.
  • Breiman, L. (2001). Random forests. Machine Learning45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  • Cai, D., Zhang, C., & He, X. (2010). Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 333–342).
  • Chen, J., Stern, M., Wainwright, M. J., & Jordan, M. I. (2017). Kernel feature selection via conditional covariance minimization. Advances in Neural Information Processing Systems30, 6949–6958.
  • Dua, D., & Graff, C. (2017). UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml(open in a new window).
  • Duda, P. O., Hart, P. ., & Stork, D. G. (2000). Pattern Classification. Wiley Hoboken.
  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology70(5), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  • Fang, Z., Wang, J., Geng, J., & Kan, X. (2019). Feature selection for malware detection based on reinforcement learning. IEEE Access7, 176177–176187. https://doi.org/10.1109/Access.6287639
  • Fard, S. M. H., Hamzeh, A., & Hashemi, S. (2012). A game theoretic framework for feature selection. In 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (pp. 845–850).
  • Fard, S. M. H., Hamzeh, A., & Hashemi, S. (2013). Using reinforcement learning to find an optimal set of features. Computers & Mathematics with Applications66(10), 1892–1904. https://doi.org/10.1016/j.camwa.2013.06.031
  • Gaudel, R., & Sebag, M. (2010). Feature selection as a one-player game. In International Conference on Machine Learning (pp. 359–366).
  • Gini, C. W. (1912). Variability and Mutability, Contribution to the Study of Statistical Distribution and Relations. StudiEconomico-Giuricici Della R.
  • Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning46(1/3), 389–422. https://doi.org/10.1023/A:1012487302797
  • Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning (pp. 359–366).
  • He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Advances in Neural Information Processing Systems18, 507–514.
  • Kakade, S. M. (2001). A natural policy gradient. Advances in Neural Information Processing Systems14, 1531–1538.
  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio & Y. LeCun (Eds.), The 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. Retrieved from http://arxiv.org/abs/1412.6980(open in a new window).
  • Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Machine Learning Proceedings 1992 (pp. 249–256). Elsevier.
  • Lemhadri, I., Ruan, F., Abraham, L., & Tibshirani, R. (2021). LassoNet: A neural network with feature sparsity. Journal of Machine Learning Research22(127), 1–29.
  • Lewis, D. D. (1992). Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23–26, 1992.
  • Lin, D., & Tang, X. (2006). Conditional infomax learning: An integrated framework for feature extraction and fusion. In European Conference on Computer Vision (pp. 68–82).
  • Liu, K., Fu, Y., Wang, P., Wu, L., Bo, R., & Li, X. (2019). Automating feature subspace exploration via multi-agent reinforcement learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 207–215).
  • Liu, Y., & Ročková, V. (2023). Variable selection via Thompson sampling. Journal of the American Statistical Association118(541), 287–304. https://doi.org/10.1080/01621459.2021.1928514
  • Mirzadeh, N., Ricci, F., & Bansal, M. (2005). Feature selection methods for conversational recommender systems. In 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (pp. 772–777).
  • Nie, F., Huang, H., Cai, X., & Ding, C. (2010). Efficient and robust feature selection via joint l2,1-norms minimization. In Proceedings of the 23rd International Conference on Neural Information Processing Systems-Volume 2 (pp. 1813–1821).
  • Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In AAAI (Vol. 2, pp. 671–676).
  • Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence27(8), 1226–1238. https://doi.org/10.1109/TPAMI.2005.159
  • Shortreed, S. M., & Ertefaie, A. (2017). Outcome-adaptive Lasso: Variable selection for causal inference. Biometrics73(4), 1111–1122. https://doi.org/10.1111/biom.12679
  • Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning3(1), 9–44.
  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  • Varma, M., & Babu, B. R. (2009). More generality in efficient multiple kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1065–1072).
  • Vidal-Naquet, M., & Ullman, S. (2003). Object recognition with informative features and linear classification. In ICCV (Vol. 3, p. 281).
  • Yang, H., & Moody, J. (1999). Data visualization and feature selection: New algorithms for nongaussian data. In Advances in Neural Information Processing Systems (Vol. 12).
  • Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks94, 103–114. https://doi.org/10.1016/j.neunet.2017.07.002
  • Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics38(2), 894–942. https://doi.org/10.1214/09-AOS729
  • Zhang, T. (2011). Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory57(7), 4689–4708. https://doi.org/10.1109/TIT.2011.2146690
  • Zhao, Z., & Liu, H. (2007). Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning (pp. 1151–1157).

To cite this article: Yuan Le, Yang Bai & Fan Zhou (20 Jun 2025): Reinforced variable selection, Statistical Theory and Related Fields, DOI: 10.1080/24754269.2025.2516346

To link to this article: https://doi.org/10.1080/24754269.2025.2516346