Review Articles

Semiparametric Bayesian analysis of high-dimensional censored outcome data

Chetkar Jha ,

Department of Statistics, University of Missouri, Columbia, MO, USA

Yi Li ,

Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

Subharup Guha

Department of Statistics, University of Missouri, Columbia, MO, USA

Pages 194-204 | Received 21 May. 2017, Accepted 21 Oct. 2017, Published online: 10 Nov. 2017,
  • Abstract
  • Full Article
  • References
  • Citations


The Surveillance, Epidemiology and End Results (SEER) cancer database contains survival data for US individuals diagnosed with cancer. Semiparametric Bayesian methods are computationally expensive to fit for such large data-sets. This paper develops a cost-effective Markov chain Monte Carlo strategy for censored outcomes to fit a semiparametric bayesian analysis of SEER data of New Mexico. We use an accelerated failure time model, with Dirichlet process random effects for inter-subject variation, and intrinsic conditionally autoregressive random effects for spatial correlations. The results offer insights into differences in breast cancer mortality rates between ethnic groups, tumor grade and spatial effect of counties.


  1. Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2, 11521174[Google Scholar]
  2. Banerjee, S., Carlin, B. P., & Gelfand, A. E. (2015). Hierarchical modeling and analysis for spatial data. (2nd ed.). Boca Raton, FL: Chapman and Hall/CRC Press[Google Scholar]
  3. Banerjee, S., Wall, M. M., & Carlin, B. P. (2003). Frailty modeling for spatially correlated survival data, with application to infant mortality in Minnesota. Biostatistics, 4(1), 123142[Google Scholar]
  4. Besag, J., Mollie, A., & York, J. (1991). Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics, 43, 120[Google Scholar]
  5. Blackwell, D., & MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. The Annals of Statistics, 1, 353355[Google Scholar]
  6. Blei, D. M., & Jordan, M. I. (2005). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1, 123[Google Scholar]
  7. Buckley, J., & James, I. (1979). Linear regression with censored data. Biometrika, 66(3), 429436[Google Scholar]
  8. Bush, C. A., & MacEachern, S. N. (1996). A semi-parametric Bayesian model for randomized block designs. Biometrika, 83, 275285[Google Scholar]
  9. Carlin, B. P., & Hodges, J. S. (1999). Hierarchical proportional hazards regression models for highly stratified data. Biometrics, 55(4), 11621170[Google Scholar]
  10. Carter, C. L., Allen C., & Henson D. E,. (1989). Relation of tumor size, Lymph node status, and survival in 24,740 breast cancer cases. Cancer, 63, 181187[Google Scholar]
  11. Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269276[Google Scholar]
  12. DeSantis, C., Siegel, R., Bandi, P., Jemal, A. (2011). Breast cancer statistics, 2011. CA A Cancer Journal for Clinicians, 61(6), 40918[Google Scholar]
  13. Diva, U., Banerjee, S., & Dey D. K,. (2007). Modelling spatially correlated survival data for individuals with multiple cancers. Stat Modelling, 7(2), 191213[Google Scholar]
  14. DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., & Pregibon, D. (1999). Squashing flat files flatter. In Proceedings of the Fifth ACM conference on knowledge discovery and data mining (pp. 615). [Google Scholar]
  15. Escobar, M. D. (1994). Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association, 89, 268277[Taylor & Francis Online], [Google Scholar]
  16. Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577588[Taylor & Francis Online], [Google Scholar]
  17. Ferguson, T. S. (1973). Estimating normal means with a Dirichlet process prior. Annals of Statistics, 1, 209230[Google Scholar]
  18. Freedman, D. (1963). On the asymptotic behavior of bayes estimates in the discrete case. Annals of Mathematical Statistics, 34, 13861403[Google Scholar]
  19. Gelfand, A. E., & Mallick, B. K. (1995). Bayesian analysis of proportional hazards models built from monotone functions. Biometrics, 51, 843852[Google Scholar]
  20. Ghosal, S., Ghosh, J. K., & Ramamoorthi, R. V. (1999). Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 27(1), 143158[Google Scholar]
  21. Guha, S. (2010). Posterior simulation in countable mixture models for large datasets. Journal of the American Statistical Association, 105, 775786[Taylor & Francis Online], [Google Scholar]
  22. Hanson, T. E. (2006). Inference for mixtures of finite Polya tree models. Journal of the American Statistical Association, 101(476), 15481565[Taylor & Francis Online], [Google Scholar]
  23. Hanson, T. E., Jara, A., Zhao, L. (2011). A Bayesian semiparameteric temporally-stratified proportional hazard model with spatial frailities. Bayesian Analysis, 6(4), 148[Google Scholar]
  24. Hanson, T. E., & Johnson, W. O. (2002). Modeling regression error with a mixture of Polya trees. Journal of the American Statistical Association, 97(460), 10201033[Taylor & Francis Online], [Google Scholar]
  25. Hanson, T. E., & Yang, M. (2007). Bayesian semiparametric proportional odds models. Biometrics, 63(1), 8895[Google Scholar]
  26. Hennerfeind, A., Brezger, A., & Fahrmeir, L. (2006). Geoadditive survival models. Journal of the American Statistical Association, 101(475), 10651075[Taylor & Francis Online], [Google Scholar]
  27. Ibrahim, J. G., Chen, M. H., & Sinha, D. (2001). Bayesian survival analysis. New York, NY: Springer Verlag[Google Scholar]
  28. Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12, 941963[Google Scholar]
  29. Jatoi, I., Chen, B. E., Anderson W. F., & Rosenberg, P. S. (2007). Breast cancer mortality trends in the united states according to estrogen receptor status and age at diagnosis. Journal of Clinical Oncology, 25(13), 1683–1690. [Google Scholar]
  30. Kalbfleisch, J. D. (1978). Nonparametric Bayesian analysis of survival time data. Journal of the Royal Statistical Society, Series B (Methodological), 40, 214221[Google Scholar]
  31. Kay, R., & Kinnersley, N. (2002). On the use of the accelerated failure time model as an alternative to the proportional hazards model in the treatment of time to event data: A case study in influenza. Drug Information Journal, 36, 571579[Google Scholar]
  32. Kneib, T., & Fahrmeir, L. (2007). A mixed model approach for geoadditive hazard regression. Scandinavian Journal of Statistics, 34(1), 207228[Google Scholar]
  33. Komárek, A., & Lesaffre, E. (2007). Bayesian accelerated failure time model for correlated censored data with a normal mixture as an error distribution. Statistica Sinica, 17, 549569[Google Scholar]
  34. Komárek, A., & Lesaffre, E. (2008). Bayesian accelerated failure time model with multivariate doubly-interval-censored data and flexible distributional assumptions. Journal of the American Statistical Association, 103, 523533[Taylor & Francis Online], [Google Scholar]
  35. Kottas, A., & Gelfand, A. E. (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association, 95, 14581468[Google Scholar]
  36. Kuo, L., & Mallick, B. (1997). Bayesian semiparametric inference for the accelerated failure-time model. Canadian Journal of Statistics, 25, 457472[Google Scholar]
  37. Li, Y., & Ryan, L. (2002). Modeling spatial survival data using semiparametric frailty models. Biometrics, 58(2), 287297[Google Scholar]
  38. MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics: Simulation and Computation, 23, 727741[Taylor & Francis Online], [Google Scholar]
  39. Müller, P., Quintana, F. A., Jara, A., & Hanson, T. (2015). Bayesian nonparametric data analysis. New York, NY: Springer[Google Scholar]
  40. Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9, 283297[Google Scholar]
  41. Newman L. A., Griffith, K. A. Jatoi, I. Simon, M. S., Crowe, J. P., & Colditz, G. A. (2006). Meta-analysis of survival in african american and white american patients with breast cancer: Ethnicity compared with socioeconomic status. Journal of Clinical Oncology, 24(9), 1342–1349. [Google Scholar]
  42. Nieto-Barajas, L. E. (2013). Lévy-driven processes in Bayesian nonparametric inference. Boletín de la Sociedad Matemática Mexicana, 19, 267279[Google Scholar]
  43. Pan, C., Cai, B., Wang, L., & Lin, X. (2014). Bayesian semi-parametric model for spatial interval-censored survival data. Computational Statistics & Data Analysis, 74, 198209[Google Scholar]
  44. Pennell, M. L., & Dunson, D. B. (2007). Fitting semiparametric random effects models to large data sets. Biostatistics, 4, 821834[Google Scholar]
  45. Roesnberg, J., Chia, Y. L., & Plevritis S,. (2005). The effect of age, race, tumor size, tumor grade, and disease stage on invasive ductal breast cancer survival in the U.S SEER database. Breast Cancer Research and Treatment, 89(1), 4754[Google Scholar]
  46. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639650[Google Scholar]
  47. Sethuraman, J., & Tiwari, R. C. (1982). Convergence of Dirichlet measures and the interpretation of their parameter. In S. S. Gupta, & J. O. Berger (Eds.), Statistical decision theory and related topics III, in two volumes (Vol. 2, pp. 305315). New York, NY: Academic Press[Google Scholar]
  48. Surveillance, Epidemiology, and End Results (SEER) Program ( Limited Use Data (1973–2012). National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2015, based on the November 2014 submission. [Google Scholar]
  49. Walker, S. G., & Mallick, B. K. (1999). A Bayesian semiparametric accelerated failure time model. Biometrics, 55(2), 477483[Crossref], [Web of Science ®], [Google Scholar]
  50. West, M., Müller, P., & Escobar, M. D. (1994). Hierarchical priors and mixture models, with application in regression and density estimation. In A. F. M. Smith & P. Freeman (Eds.), Aspects of uncertainty: A tribute to D. V. Lindley (pp. 363368). New York, NY: Wiley[Google Scholar]
  51. Zhao, L., Hanson, T. E., & Carlin, B. P. (2009). Mixtures of Polya trees for flexible spatial frailty survival modelling. Biometrika, 96(2), 263276[Google Scholar]
  52. Zhou, H., & Hanson, T. (2015). Bayesian spatial survival models. In R. Mitra & P. Müller (Eds.), Nonparametric Bayesian inference in biostatistics. Frontiers in probability and the statistical sciences. Springer[Google Scholar]
  53. Zhou, H., & Hanson, T. (2017). A unified framework for fitting Bayesian semiparametric models to arbitrarily censored survival data, including spatially-referenced data. Journal of the American Statistical Association, (to appear). [Google Scholar]