大规模的分布式数据库中, 诸如网络分区、信息丢失、节点宕机等软硬件故障无法避免. 为了提高分布式数据库的可靠性、验证容错协议的正确性, 分布式数据库应定期进行故障注入测试, 即在系统运行过程中人为引发故障. 然而各种故障的组合空间太大, 无法枚举. 已有的测试方法: 一类是随机式故障组合, 其实现方法简单但不能保证探索了所有的故障组合; 另一类是通过专业知识分析系统构成并设计的故障组合, 其测试结果更加完善但不具备普及性. 以线性数据驱动的故障注入测试LDFI (Lineage-Driven Fault Injection)为原型, 在分布式数据库的基础上,实现了一种同时具有完备性和普及性的自动化故障注入测试工具. 实验结果表明, 该测试工具能够以更少的测试案例, 发现随机式故障注入无法发现的复合故障组合所引起的系统漏洞(bug), 提高了数据库的可信度.
Failures are unavoidable in distributed database systems. To improve the fault tolerance of distributed database systems and verify the accuracy of fault-tolerant protocols, the system should periodically run a fault injection test to artificially trigger a fault during system operations. However, the scale and complexity of distributed database systems make it difficult to fully enumerate inputs and make it impractical to explain all the behaviors that occur in the system. One of the test methods commonly used is a random combination of faults, which is simple but not complete; the other one is guided by professional knowledge and is not universally applicable. Accordingly, we adopted and revised a research prototype, called the lineage-driven fault injection (LDFI) test, that is both complete and universally applicable. We implemented the automation fault injection tool in Cedar. Experiments showed that lineage-driven fault injection tests can successfully detect system bugs caused by complex fault combinations and improve the credibility of the database; these bugs cannot be detected by random fault injection with fewer test cases.
[1] BAILIS P, KINGSBURY K. The network is reliable [J]. Communications of the ACM, 2014, 57(9): 48-55. DOI: 10.1145/2643130.
[2] DEAN J. Designs, lessons and advice from building large distributed systems [R/OL]. The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware.(2009-10-10)[2019-08-07]. http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf.
[3] ALSBERG P A, DAY J D. A principle for resilient sharing of distributed resources [C]//Proceedings of the 2nd International Conference on Software Engineering. IEEE, 1976: 562-570.
[4] STONEBRAKER M. Concurrency control and consistency of multiple copies of data in distributed INGRES [J]. IEEE Transactions on Software Engineering, 1979(3): 188-194. DOI: 10.1109/TSE.1979.234180.
[5] GARCIA-MOLINA H. Elections in a distributed computing system [J]. IEEE Transactions on Computers, 1982(1): 48-59.
[6] LEESATAPORNWONGSA T, LUKMAN J F, LU S, et al. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems [J]. ACM SIGPLAN Notices, 2016, 51(4): 517-530. DOI: 10.1145/2954679.2872374.
[7] GAO Y, DOU W S, QIN F, et al. An empirical study on crash recovery bugs in large-scale distributed systems [C]//Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 2018: 539-550.
[8] FONSECA P, ZHANG K, WANG X, et al. An empirical study on the correctness of formally verified distributed systems [C]//Proceedings of the 12th European Conference on Computer Systems. ACM, 2017: 328-343.
[9] GANESAN A, ALAGAPPAN R, ARPACI-DUSSEAU A C, et al. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions [J]. ACM Transactions on Storage, 2017, 13(3): Article 20. DOI: 10.1145/3125497.
[10] DAWSON S, JAHANIAN F, MITTON T. ORCHESTRA: A fault injection environment for distributed systems [C]// Proceedings of the 26th International Symposium on Fault Tolerant Computing. IEEE, 1996: 404-414.
[11] GUNAWI H S, DO T, JOSHI P, et al. FATE and DESTINI: A framework for cloud recovery testing [C]// Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. ACM, 2011: 238-252.
[12] KANAWATI G A, KANAWATI N A, ABRAHAM J A. FERRARI: A flexible software-based fault and error injection system [J]. IEEE Transactions on Computers, 1995, 44(2): 248-260. DOI: 10.1109/12.364536.
[13] LEESATAPORNWONGSA T, HAO M Z, JOSHI P, et al. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems [C]// Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association. 2014: 399-414.
[14] BASIRI A, BEHNAM N, DE ROOIJ R, et al. Chaos engineering [J]. IEEE Software, 2016, 33(3): 35-41. DOI: 10.1109/MS.2016.60.
[15] ALVARO P, ROSEN J, HELLERSTEIN J M. Lineage-driven fault injection [C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 331-346.
[16] ALVARO P, MARCZAK W R, CONWAY N, et al. Dedalus: Datalog in time and space [C]//International Datalog 2.0 Workshop Datalog 2.0 2010: Datalog Reloaded. Berlin: Springer, 2010: 262-281.
[17] BUNEMAN P, KHANNA S, TAN W C. Why and where: A characterization of data provenance [C]//International Conference on Database Theory. Berlin: Springer, 2001: 316-330.
[18] CUI Y, WIDOM J, WIENER J L. Tracing the lineage of view data in awarehousing environment [J]. ACM Transactions on Database Systems (TODS), 2000, 25(2): 179-227. DOI: 10.1145/357775.357777.
[19] ALVARO P, CONWAY N, HELLERSTEIN J M, et al. Consistency analysis in bloom: A CALM and collected approach [C]// 5th Biennial Conference on Innovative Data Systems Research (CIDR ’11). 2011: 249-260.
[20] LUKMAN J F, KE H, STUARDO C A, et al. FlyMC: Highly scalable testing of complex interleavings in distributed systems [C]//Proceedings of the 14th EuroSys Conference 2019. ACM, 2019: Article number 20.
[21] MAJUMDAR R, NIKSIC F. Why is random testing effective for partition tolerance bugs? [J]. Proceedings of the ACM on Programming Languages, 2018, 2(POPL): Article number 46. DOI: 10.1145/3158134.
[22] LAMPORT L. The part-time parliament [J]. ACM Transactions on Computer Systems (TOCS), 1998, 16(2): 133-169. DOI: 10.1145/279227.279229.
[23] ONGARO D, OUSTERHOUT J. In search of an understandable consensus algorithm [C]// Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, 2014: 305-320.