Reinforced variable selection

ISSN 2475-4269

CN 31-2182/O1

Yang Bai ,

School of Statistics and Data Science, Shanghai University of Finance and Economics, Shanghai, People's Republic of China

statbyang@mail.shufe.edu.cn

Fan Zhou

School of Statistics and Data Science, Shanghai University of Finance and Economics, Shanghai, People's Republic of China

Pages | Received 01 Nov. 2024, Accepted 01 Jun. 2025, Published online: 20 Jun. 2025,

Abstract
Full Article
References
Citations

Variable selection identifies the best subset of covariates when building the prediction model, among all possible subsets. In this paper, we propose a novel reinforced variable selection method, called ‘Actor-Critic-Predictor’. The actor takes an action to choose variables and the predictor evaluates the action based on a well-designed reward function, where the critic learns the reward baseline. We model the variable selection process as a multi-armed bandit and update the subset of selected variables using a natural policy gradient algorithm. We provide an analytical framework on how different errors impact the performance of our method theoretically. Large amounts of experiments on synthetic and real datasets show that the proposed framework is easily implemented and outperforms classical variable selection methods in a wide range of scenarios.

References

Agarwal, A., Kakade, S. M., Lee, J. D., & Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98), 1–76.
Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. Advances in Neural Information Processing Systems, 21, 105–112.
Barbu, A., She, Y., Ding, L., & Gramajo, G. (2016). Feature selection with annealing for computer vision and big data learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(2), 272–286. https://doi.org/10.1109/TPAMI.2016.2544315
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550. https://doi.org/10.1109/72.298224
Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732. https://doi.org/10.1214/08-AOS620
Boullé, M. (2007). Compression-based averaging of selective naive Bayes classifiers. The Journal of Machine Learning Research, 8, 1659–1685.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Cai, D., Zhang, C., & He, X. (2010). Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 333–342).
Chen, J., Stern, M., Wainwright, M. J., & Jordan, M. I. (2017). Kernel feature selection via conditional covariance minimization. Advances in Neural Information Processing Systems, 30, 6949–6958.
Dua, D., & Graff, C. (2017). UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml(open in a new window).
Duda, P. O., Hart, P. ., & Stork, D. G. (2000). Pattern Classification. Wiley Hoboken.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
Fang, Z., Wang, J., Geng, J., & Kan, X. (2019). Feature selection for malware detection based on reinforcement learning. IEEE Access, 7, 176177–176187. https://doi.org/10.1109/Access.6287639
Fard, S. M. H., Hamzeh, A., & Hashemi, S. (2012). A game theoretic framework for feature selection. In 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (pp. 845–850).
Fard, S. M. H., Hamzeh, A., & Hashemi, S. (2013). Using reinforcement learning to find an optimal set of features. Computers & Mathematics with Applications, 66(10), 1892–1904. https://doi.org/10.1016/j.camwa.2013.06.031
Gaudel, R., & Sebag, M. (2010). Feature selection as a one-player game. In International Conference on Machine Learning (pp. 359–366).
Gini, C. W. (1912). Variability and Mutability, Contribution to the Study of Statistical Distribution and Relations. StudiEconomico-Giuricici Della R.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1/3), 389–422. https://doi.org/10.1023/A:1012487302797
Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning (pp. 359–366).
He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Advances in Neural Information Processing Systems, 18, 507–514.
Kakade, S. M. (2001). A natural policy gradient. Advances in Neural Information Processing Systems, 14, 1531–1538.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio & Y. LeCun (Eds.), The 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. Retrieved from http://arxiv.org/abs/1412.6980(open in a new window).
Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Machine Learning Proceedings 1992 (pp. 249–256). Elsevier.
Lemhadri, I., Ruan, F., Abraham, L., & Tibshirani, R. (2021). LassoNet: A neural network with feature sparsity. Journal of Machine Learning Research, 22(127), 1–29.
Lewis, D. D. (1992). Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23–26, 1992.
Lin, D., & Tang, X. (2006). Conditional infomax learning: An integrated framework for feature extraction and fusion. In European Conference on Computer Vision (pp. 68–82).
Liu, K., Fu, Y., Wang, P., Wu, L., Bo, R., & Li, X. (2019). Automating feature subspace exploration via multi-agent reinforcement learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 207–215).
Liu, Y., & Ročková, V. (2023). Variable selection via Thompson sampling. Journal of the American Statistical Association, 118(541), 287–304. https://doi.org/10.1080/01621459.2021.1928514
Mirzadeh, N., Ricci, F., & Bansal, M. (2005). Feature selection methods for conversational recommender systems. In 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (pp. 772–777).
Nie, F., Huang, H., Cai, X., & Ding, C. (2010). Efficient and robust feature selection via joint l2,1-norms minimization. In Proceedings of the 23rd International Conference on Neural Information Processing Systems-Volume 2 (pp. 1813–1821).
Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In AAAI (Vol. 2, pp. 671–676).
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238. https://doi.org/10.1109/TPAMI.2005.159
Shortreed, S. M., & Ertefaie, A. (2017). Outcome-adaptive Lasso: Variable selection for causal inference. Biometrics, 73(4), 1111–1122. https://doi.org/10.1111/biom.12679
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Varma, M., & Babu, B. R. (2009). More generality in efficient multiple kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1065–1072).
Vidal-Naquet, M., & Ullman, S. (2003). Object recognition with informative features and linear classification. In ICCV (Vol. 3, p. 281).
Yang, H., & Moody, J. (1999). Data visualization and feature selection: New algorithms for nongaussian data. In Advances in Neural Information Processing Systems (Vol. 12).
Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94, 103–114. https://doi.org/10.1016/j.neunet.2017.07.002
Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942. https://doi.org/10.1214/09-AOS729
Zhang, T. (2011). Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7), 4689–4708. https://doi.org/10.1109/TIT.2011.2146690
Zhao, Z., & Liu, H. (2007). Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning (pp. 1151–1157).

To cite this article: Yuan Le, Yang Bai & Fan Zhou (2025) Reinforced variable selection, Statistical Theory and Related Fields, 9:3, 297-314, DOI: 10.1080/24754269.2025.2516346

To link to this article: https://doi.org/10.1080/24754269.2025.2516346

Archives

References

Authors

About the Journal

Links

Search

Archives