Review Articles

Multi-category diagnostic accuracy based on logistic regression

Jialiang Li ,

Department of Statistics and Applied Probability, Duke-NUS Graduate Medical School, Singapore Eye Research Institute, National University of Singapore, Singapore

Jason P. Fine

Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA

Pages 143-158 | Received 01 Mar. 2017, Accepted 20 Mar. 2017, Published online: 11 May. 2017,
  • Abstract
  • Full Article
  • References
  • Citations


We provide a detailed review for the statistical analysis of diagnostic accuracy in a multi-category classification task. For qualitative response variables with more than two categories, many traditional accuracy measures such as sensitivity, specificity and area under the ROC curve are no longer applicable. In recent literature, new diagnostic accuracy measures are introduced in medical research studies. In this paper, important statistical concepts for multi-category classification accuracy are reviewed and their utilities are demonstrated with real medical examples. We offer problem-based R code to illustrate how to perform these statistical computations step by step. We expect such analysis tools will become more familiar to practitioners and receive broader applications in biostatistics. Our program can be adapted to many classifiers among which logistic regression may be the most popular approach. We thus base our discussion and illustration completely on the logistic regression in this paper.


  1. Allwein, E., Schapire, R., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113141[Google Scholar]
  2. Alonzo, T. A., & Nakas, C. T. (2007). Comparison of roc umbrella volumes with an application to the assessment of lung cancer diagnostic markers. Biometrical Journal, 49, 654664[Google Scholar]
  3. Alonzo, T. A., Nakas, C. T., Yiannoutsos, C. T., & Bucher, S. (2009). A comparison of tests for restricted orderings in the three-class case. Statistics in Medicine, 28, 11441158[Google Scholar]
  4. Austin, P. C., & Steyerberg, E. W. (2013). Predictive accuracy of risk factors and markers: A simulation study of the effect of novel markers on different performance measures for logistic regression models. Statistics in Medicine, 32, 661672[Google Scholar]
  5. Beffa, C. B., Slansky, E., Pommerenke, C., Klawonn, F., Li, J., Dai, L., … Pessler, F. (2013). The relative composition of the inflammatory infiltrate as an additional tool for synovial tissue classification. PLoS ONE, 8, e72494[Google Scholar]
  6. Breiman, L., Friedman, J. H., Olshen, R., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth[Google Scholar]
  7. Cox, D. R., & Wermuth, N. (1992). A comment on the coefficient of determination for binary response. The American Statisticians, 46, 14[Taylor & Francis Online][Google Scholar]
  8. Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265292[Google Scholar]
  9. Delaigle, A., & Hall, P. (2012). Achieving near-perfect classification for functional data. Journal of the Royal Statistical Society: Series B, 74, 267286[Google Scholar]
  10. Dreiseiltl, S., Ohno-machado, L., & Binder, M. (2000). Comparing three-class diagnostic tests by three-way ROC analysis. Medical Decision Making, 20, 323331[Google Scholar]
  11. Edwards, D. C., & Metz, C. E. (2006). Analysis of proposed three-class classification decision rules in terms of the ideal observer decision rule. Journal of Mathematical Psychology, 50, 478487[Google Scholar]
  12. Edwards, D. C., Metz, C. E., & Kupinski, M. A. (2004). Ideal observers and optimal ROC hypersurfaces in n-class classification. IEEE Transactions on Medical Imaging, 23, 891895[Google Scholar]
  13. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359378[Taylor & Francis Online][Google Scholar]
  14. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, H., … Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531537[Google Scholar]
  15. Hand, D. J., & Till, R. T. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45, 171186[Google Scholar]
  16. He, X., & Frey, E. C. (2007). An optimal three-class linear observer derived from decision theory. IEEE Transactions on Medical Imaging, 26, 7783[Google Scholar]
  17. He, X., Gallas, B. D., & Frey, E. C. (2010). Three-class ROC analysis – toward a general decision theoretic solution. IEEE Transactions on Medical Imaging, 29, 206215[Google Scholar]
  18. Heckerling, P. S. (2001). Parametric three-way receiver operating characteristic surface analysis using mathematica. Medical Decision Making, 20, 409417[Google Scholar]
  19. Hilden, J., & Gerds, Thomas A. (2014). A note on the evaluation of novel biomarkers: Do not rely on integrated discrimination improvement and net reclassification index. Statistics in Medicine, 33(19), 34053414[Google Scholar]
  20. Hu, B., Palta, M., & Shao, J. (2006). Properties of r2 statistics for logistic regression. Statistics in Medicine, 25, 13831395[Google Scholar]
  21. Huang, Z., Li, J., Cheng, C. Y., Cheung, C., & Wong, T. Y. (2016, July). Bayesian reclassification statistics for assessing improvements in diagnostic accuracy. Statistics in Medicine, 35, 25742592. ISSN 0277-6715. doi: 10.1002/sim.6899[Google Scholar]
  22. Kerr, Kathleen F., Wang, Z., Janes, H., McClelland, Robyn L., Psaty, Bruce M., & Pepe, M. S. (2014). Net reclassification indices for evaluating risk prediction instruments: A critical review. Epidemiology, 25(1), 114121[Google Scholar]
  23. Koltchinskii, V., & Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 150[Google Scholar]
  24. Lee, Y., Lin, Y., & Wahba, G. (2004). Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99, 6781[Taylor & Francis Online][Google Scholar]
  25. Li, J., & Fine, J. P. (2008). ROC analysis with multiple tests and multiple classes: Methodology and applications in microarray studies. Biostatistics, 9, 566576[Google Scholar]
  26. Li, J., & Fine, J. P. (2010). Weighted area under the receiver operating characteristic curve and its application to gene selection. Journal of the Royal Statistical Society Series C (Applied Statistics), 59, 673692[Google Scholar]
  27. Li, J., & Zhou, X. H. (2009). Nonparametric and semi-parametric estimation of the three way receiver operating characteristic surface. Journal of Statistical Planning and Inference, 139, 41334142[Google Scholar]
  28. Li, J., Jiang, B., & Fine, J. P. (2013a). Multicategory reclassification statistics for assessing improvements in diagnostic accuracy. Biostatistics, 14(2), 382394[Google Scholar]
  29. Li, J., Jiang, B., & Fine, J. P. (2013b). Letter to editor: Response. Biostatistics, 14(4), 809810[Google Scholar]
  30. Li, J., Chow, Y., Wong, W. K., & Wong, T. Y. (2014). Sorting multiple classes in multi-dimensional ROC analysis: Parametric and nonparametric approaches. Biomarkers, 19(1), 18[Taylor & Francis Online][Google Scholar]
  31. Li, J., Feng, Q., Fine, J., Pencina, M., & Van Calster, B. (2017). Nonparametric estimation and inference for polytomous discrimination index. Statistical Methods in Medical Research. doi: 10.1177/0962280217692830 [Google Scholar]
  32. Luo, J., & Xiong, C. (2013). Youden index and associated cut-points for three ordinal diagnostic groups. Communications in Statistics – Simulation and Computation, 42, 12131234[Taylor & Francis Online][Google Scholar]
  33. Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. The American Statisticians, 54, 1724[Taylor & Francis Online][Google Scholar]
  34. Mossman, D. (1999). Three-way ROCs. Medical Decision Making, 19, 7889[Google Scholar]
  35. Nakas, C. T., & Alonzo, T. A. (2007). ROC graphs for assessing the ability of a diagnostic marker to detect three disease classes with an umbrella ordering. Biometrics, 63, 603609[Google Scholar]
  36. Nakas, C. T., & Yiannoutsos, C. T. (2004). Ordered multiple-class ROC analysis with continuous measurements. Statistics in Medicine, 23, 34373449[Google Scholar]
  37. Nakas, C. T., Alonzo, T. A., & Yiannoutsos, C. T. (2010). Accuracy and cut-off point selection in three-class classification problems using a generalization of the Youden index. Statistics in Medicine, 29, 29462955[Google Scholar]
  38. Nakas, C. T., Dalrymple-Alford, J. C., Anderson, T. J., & Alonzo, T. A. (2012). Generalization of Youden index for multiple-class classification problems applied to the assessment of externally validated cognition in parkinson disease screening. Statistics in Medicine, 95, 9951003[Google Scholar]
  39. Novoselova, N., Beffa, C. D., Wang, J., Li, J., Pessler, F., & Klawonn, K. (in press). HUM calculator and HUM package for R: Easy-to-use software tools for multicategory receiver operating characteristic analysis. Bioinformatics[Google Scholar]
  40. Obuchowski, N. (2005). Estimating and comparing diagnostic tests’ accuracy when the gold standard is not binary. Academic Radiology, 12, 11981204[Google Scholar]
  41. Ogdie, A., Li, J., Dai, L., Pessler, M. E., Yu, X., et al. (2010). Identification of broadly discriminatory tissue biomarkers of synovitis with binary and multicategory receiver operating characteristic analysis. Biomarkers, 15, 183190[Taylor & Francis Online][Google Scholar]
  42. Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine, 27, 157172[Google Scholar]
  43. Pencina, M. J., D’Agostino Sr, R. B., & Steyerberg, E. W. (2011). Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Statistics in Medicine, 30, 1121[Google Scholar]
  44. Pencina, M. J., D’Agostino Sr, R. B., & Demler, O. V. (2012). Novel metrics for evaluating improvement in discrimination: Net reclassification and integrated discrimination improvements for normal variables and nested models. Statistics in Medicine, 31, 101113[Google Scholar]
  45. Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. New York: Oxford University Press[Google Scholar]
  46. Pepe, M. S., Janes, H., Longton, G., Leisenring, W., & Newcomb, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. American Journal of Epidemiology, 159, 882890[Google Scholar]
  47. Pepe, M. S., Feng, Z., & Gu, J. W. (2008a). Comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al. Statistics in Medicine, 27, 173181[Google Scholar]
  48. Pepe, M. S., Feng, Z., & Gu, J. W. (2008b). Comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al. Statistics in Medicine, 27(2), 173181[Google Scholar]
  49. Ressom, H. W., Varghese, R. S., Drake, S. K., Hortin, G. L., Abdel-Hamid, M., Loffredo, C. A., … Goldman, R. (2007). Peak selection from MALDI-TOF mass spectra using ant colony optimization. Bioinformatics, 23, 619626[Google Scholar]
  50. Ressom, H. W., Varghese, R. S., Goldman, L., Loffredo, C. A., Abdel-Hamid, M., Kyselova, Z., … Goldman, R. (2008). Analysis of MALDI-TOF mass spectrometry data for detection of Glycan biomarkers. Pacific Symposium on Biocomputing, 13, 216227[Google Scholar]
  51. Schubert, C. M., Thorsen, S., & Oxley, M. (2011). The roc manifold for classification systems. Pattern Recognition, 44, 350362[Google Scholar]
  52. Scurfield, B. K. (1996). Multiple-event forced-choice tasks in the theory of signal detectability. Journal of Mathematical Psychology, 40, 253269[Google Scholar]
  53. Selten, R. (1998). Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1, 4362[Google Scholar]
  54. Shao, F., Li, J., Fine, J., Wong, W. K., & Pencina, M. J. (2015, January). Inference for reclassification statistics under nested and non-nested models for biomarker evaluation. Biomarkers, 20, 240252. doi: 10.3109/1354750X.2015.1068854[Taylor & Francis Online][Google Scholar]
  55. Shiu, S. Y., & Gatsonis, C. (2012). On ROC analysis with non-binary reference standard. Biometrical Journal, 54, 457480[Google Scholar]
  56. Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., … Kattane, M. W. (2010). Assessing the performance of prediction models, a framework for traditional and novel measures. Epidemiology, 21, 128138[Google Scholar]
  57. Tjur, T. (2009). Coefficients of determination in logistic regression models – a new proposal: The coefficient of discrimination. The American Statistician, 64, 366372[Taylor & Francis Online][Google Scholar]
  58. Toth, Z., Zhu, Y., & Marchok, T. (2001). The use of ensembles to identify forecasts with small and large uncertainty. Weather and Forecasting, 16, 463477[Google Scholar]
  59. Van Calster, B., Van Belle, V., Vergouwe, Y., Timmerman, D., Van Huffel, S., & Steyerberg, E. W. (2012a). Extending the c-statistic to nominal polytomous outcomes: The polytomous discrimination index. Statistics in Medicine, 31, 26102626[Google Scholar]
  60. Van Calster, B., Vergouwe, Y., Looman, C. W. N., Van Belle, V., Timmerman, D., & Steyerberg, E. W. (2012b). Assessing the discriminative ability of risk models for more than two outcome categories: A perspective. European Journal of Epidemiology, 27, 761770[Google Scholar]
  61. Vapnik, V. (1998). Statistical learning theory. New York, NY: Wiley[Google Scholar]
  62. Xiong, C., van Belle, G., Miller, J. P., & Morris, J. C. (2006). Measuring and estimating diagnostic accuracy when there are three ordinal diagnostic groups. Statistics in Medicine, 25, 12511273[Google Scholar]
  63. Zhang, Y., & Li, J. (2011). Combining multiple markers for multi-category classification: An ROC surface approach. Australian and New Zealand Journal of Statistics, 53, 6378[Google Scholar]
  64. Zhou, X. H., Obuchowski, N. A. & McClish, D. K. (2002). Statistical methods in diagnostic medicine. New York, NY: John Wiley & Sons[Google Scholar]