Review Articles

Bayesian functional enrichment analysis for the Reactome database

Jing Cao

Department of Statistical Science, Southern Methodist University, Dallas, TX, U.S.A

jcao@smu.edu

Pages 185-193 | Received 10 Apr. 2017, Accepted 05 Aug. 2017, Published online: 17 Oct. 2017,
  • Abstract
  • Full Article
  • References
  • Citations

ABSTRACT

The first step in the analysis of high-throughput experiment results is often to identify genes or proteins with certain characteristics, such as genes being differentially expressed (DE). To gain more insights into the underlying biology, functional enrichment analysis is then conducted to provide functional interpretation for the identified genes or proteins. The hypergeometric P value has been widely used to investigate whether genes from predefined functional terms, e.g., Reactome, are enriched in the DE genes. The hypergeometric P value has several limitations: (1) computed independently for each term, thus neglecting biological dependence; (2) subject to a size constraint that leads to the tendency of selecting less-specific terms. In this paper, a Bayesian approach is proposed to overcome these limitations by incorporating the interconnected dependence structure of biological functions in the Reactome database through a CAR prior in a Bayesian hierarchical logistic model. The inference on functional enrichment is then based on posterior probabilities that are immune to the size constraint. This method can detect moderate but consistent enrichment signals and identify sets of closely related and biologically meaningful functional terms rather than isolated terms. The performance of the Bayesian method is demonstrated via a simulation study and a real data application.

References

  1. Alexa, A., Rahnenführer, J., & Lengauer, T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 22, 16001607[Google Scholar]
  2. Al-Shahrour, F., Díaz-Uriarte, R., & Dopazo, J. (2004). FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics, 20, 578580[Google Scholar]
  3. Basso, K., Margolin, A. A., Stolovitzky, G., Klein, U., Dalla-Favera, R., & Califano, A. (2005). Reverse engineering of regulatory networks in human B cells. Nature Genetics, 17, 182190[Google Scholar]
  4. Beissbarth, T., & Speed, T. P. (2004). GOstat: Find statistically overrepresented gene ontologies within a group of genes. Bioinformatics, 20, 14641465[Google Scholar]
  5. Cao, J., & Zhang, S. (2014). A Bayesian extension of the hypergeometric test for functional enrichment analysis. Biometrics, 70(1), 8494[Google Scholar]
  6. Cho, R. J., Huang, M., Campbell, M. J., Dong, H., Steinmetz, L., Sapinoso, L., .. Lockhart, D. J. (2001). Transcriptional regulation and function during the human cell cycle. Nature Genetics, 27, 4854[Google Scholar]
  7. Clayton, D., & Kaldor, J. (1987). Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics, 43(3), 671681[Google Scholar]
  8. Do, K., Muller, P., & Tang, F. (2005). A Bayesian mixture model for differential gene expression. Applied Statistics, 54, 627644[Google Scholar]
  9. Draghici, S., Khatri, P., Martins, R. P., Ostermeier, G. C., & Krawetz, S. A. (2003). Global functional profiling of gene expression. Genomics, 81, 98104[Google Scholar]
  10. Elkon, R., Ugalde, A. P., & Agami, R. (2013). Alternative cleavage and polyadenylation: Extent, regulation and function. Nature Reviews Genetics, 14(7), 496506[Google Scholar]
  11. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science, 7, 457511[Google Scholar]
  12. Ghosh, M., Natarajan, K., Stroud, T. W. F., & Carlin, B. P. (1998). Generalized linear models for small-area estimation. Journal of the American Statistical Association, 93(441), 273282[Taylor & Francis Online], [Google Scholar]
  13. Gilks, W., & Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41, 337348[Google Scholar]
  14. Grossmann, S., Bauer, S., Robinson, P. N., & Vingron, M. (2007). Improved detection of overrepresentation of Gene-Ontology annotations with parent-child analysis. Bioinformatics, 23, 30243031[Google Scholar]
  15. Hsueh, R., & Scheuermann, R. H. (2000). Tyrosine kinase activation in the growth, differentiation and death responses initiated from the B cell antigen receptor. Advances in Immunology, 75, 283316[Google Scholar]
  16. Knorr-Held, L., & Rue, H. (2002). On block updating in Markov random field models for disease mapping. Scandinavian Journal of Statistics, 29(4), 597614[Crossref], [Web of Science ®], [Google Scholar]
  17. Kovalchuk, A. L., Qi, C. F., Torrey, T. A., Taddesse-Heath, L., Feigenbaum, L., Park, S. S., .. Morse, H. C. (2000). Burkitt lymphoma in the mouse. Journal of Experimental Medicine, 192(8), 11831190[Google Scholar]
  18. Lewin, A. M., & Grieve, I. C. (2006). Grouping gene ontology terms to improve the assessment of gene set enrichment in microarray data. BMC Bioinformatics, 7, 426.  [Google Scholar]
  19. Luo, F., Yang, Y., Chen, C. F., Chang, R., Zhou, J., & Scheuermann, R. H. (2007). Modular organization of protein Interaction networks. Bioinformatics, 23, 207214[Google Scholar]
  20. Ma, S., Jiang, T., & Jiang, R. (2015). Differential regulation enrichment analysis via the integration of transcriptional regulatory network and gene expression data. Bioinformatics, 31, 563571[Google Scholar]
  21. Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., de Bono, B., ... D’Eustachio, P. (2009). Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research, 37(Suppl. 1), D619D622[Google Scholar]
  22. Muller, P., Parmigiani, G., & Rice, K. (2006). FDR and Bayesian multiple comparisons rules. Nucleic Acids Research, 37, 619622[Google Scholar]
  23. Newton, M. A., Noueiry, A., Sarkar, D., & Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 4, 155176[Google Scholar]
  24. Rue, H. (2001). Fast sampling of gaussian Markov random fields. Journal of the Royal Statistical Society, Series B, 63(2), 325338[Google Scholar]
  25. Signorelli, M., Vinciotti, V., & Wit, E. C. (2016). NEAT: An efficient network enrichment analysis test. BMC Bioinformatics, 17, 352[Google Scholar]
  26. Srinivasan, S. V., Dominguez-Sola, D., Wang, L. C., Hyrien, O., & Gautier, J. (2013). Cdc45 is a critical effector of Myc-dependent DNA replication stress. Cell Reports, 3(5), 16291639[Google Scholar]
  27. Steinsland, I. (2007). Parallel exact sampling and evaluation of gaussian Markov random fields. Computational Statistics and Data Analysis, 51(6), 29692981[Google Scholar]
  28. Storey, J. D. (2002). A direct approach to false discovery rate. Journal of the Royal Statistical Society, Series B, 64, 479498[Google Scholar]
  29. Takagaki, Y., & Manley, J. L. (1998). Levels of polyadenylation factor CstF-64 control IgM heavy chain mRNA accumulation and other events associated with B cell differentiation. Molecular Cell, 2(26), 761771[Google Scholar]
  30. Takagaki, Y., Seipelt, R. L., Peterson, M. L., & Manley, J. L. (1996). The polyadenylation factor CstF-64 regulates alternative processing of IgM heavy chain pre-mRNA during B cell differentiation. Cell, 87(5), 941952[Google Scholar]
  31. The Gene Ontology Consortium. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25, 2529[Google Scholar]
  32. Tusher, V. G., Tibshirani, R., & Chu, G. (2001). Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proceedings of the National Academy of Sciences USA, 98(9), 51165121[Google Scholar]
  33. Vastrik, I., D’Eustachio, P., Schmidt, E., Gopinath, G., Croft, D., de Bono, B., ... Stein, L. (2007). Reactome: A knowledge base of biologic pathways and processes. Genome Biology, 8(3), R39[Google Scholar]
  34. Waller, L. A., Carlin, B. P., Xia, H., & Gelfand, A. E. (1997). Hierarchical spatiotemporal mapping of disease rates. Journal of the American Statistical Association, 92(438), 3977339778[Google Scholar]
  35. Walter, J. (2000). Evidence for sequential action of cdc7 and cdk2 protein kinases during initiation of DNA replication in Xenopus egg extracts. Journal of Biological Chemistry, 257(50), 607617[Google Scholar]
  36. Walter, J., & Newport, J. (2000). Initiation of eukaryotic DNA replication: Origin unwinding and sequential chromatin association of Cdc45, RPA, and DNA polymerase α. Molecular Cell, 5(4), 617627[Google Scholar]
  37. Zhang, S., Cao, J., Kong, Y. M., & Scheuermann, R. H. (2010). GO-Bayes: Gene Ontology-based over-representation analysis using a Bayesian approach. Bioinformatics, 26(7), 905911[Google Scholar]