收稿日期: 2016-06-27
网络出版日期: 2016-11-29
基金资助
国家自然科学基金重点项目(61332006), 上海市基金(13ZR1413200)
Research and implementation of transactional real-time data ingestion technology without blocking
Received date: 2016-06-27
Online published: 2016-11-29
伴随着大数据时代来临, 传统数据库系统已逐渐无法应对海量数据处理带来的挑战, 而分布式数据库系统得到了越来越多的部署和应用. 分布式数据库系统部署数据于多台机器上, 利用大规模并行计算技术实现了对海量数据的存储、管理和分析. 但针对金融领域严苛的事务型实时数据注入需求, 现有分布式数据库系统对其支持有限, 其主要原因在于利用锁和两阶段提交等方式实现分布式事务处理, 无法做到非阻塞式数据注入, 极大地影响了数据注入的性能. 华东师范大学数据科学与工程研究院自主研发的分布式内存数据库系统-----CLAIMS, 已能提供面向关系型数据集的实时数据分析服务, 但尚不能支持实时数据注入. 针对上述实时数据注入的问题, 本文重点分析了现有数据注入技术和基于分布式事务处理的实现方式, 设计了面向元数据的集中式事务处理策略, 利用无锁编程技术, 实现了支持分布式事务的高性能实时数据注入框架, 并通过热备机制实现系统的高可用性. 上述框架在 CLAIMS 系统中的实现, 经充分实验表明: 该框架能够实现高通量的事务型实时数据注入, 同时支持低延时的实时数据查询.
关键词: 分布式数据库; 实时数据注入; 事务; CLAIMS
余 楷 , 李志方 , 周敏奇 , 周傲英 . 非阻塞事务型实时数据注入技术研究与实现[J]. 华东师范大学学报(自然科学版), 2016 , 2016(5) : 131 -143 . DOI: 10.3969/j.issn.1000-5641.2016.05.015
With the advent of big data era, traditional database systems are facing difficulties in satisfying the new challenges brought by massive data processing, while distributed database systems have been deployed widely in real applications. Distributed database systems partitioned and the dispatched the data across machines under a designed scheme and analyzed all the massive data in massive parallel manner. In facing of the requirements of the transactional real-time data ingestion from financial field, distributed database systems are ineffective and inefficient due to their implementation of the distributed transaction processing based on the lock and two-phase commit, which lead to the impossibility of non-blocking data ingestion. CLAIMS is a distributed in-memory database system designed and implemented by Institute for Data Science and Engineering of ECNU. It supports real-time data analysis towards relational data set but is incapable of real-time data ingestion. To address these problems, we analyzed data ingestion technology and distributed transaction processing algorithms first, and proposed to mimic the transactional data ingestion in the distributed environment with the centralized transaction processing based on meta data, and eventually achieved the real-time data ingestion with high availability and without blocking. The experiment results with the implementation of the proposed algorithms in CLAIMS proved that the proposed framework could achieve high throughput transactional real-time data ingestion as well as low latency real-time
query processing.
[ 1 ] DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.
[ 2 ] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets [C]//Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley: USENIX Association, 2010: 10.
[ 3 ] SHVACHKO K, KUANG H, RADIA S, et al. The hadoop distributed file system [C]//Proceedings of IEEE Conference on MSST. 2010: 1-10.
[ 4 ] 胡健, 和轶东. SAP内存计算------HANA [M]. 北京: 清华大学出版社, 2013.
[ 5 ] FARBER F, CHA S K, PRIMSCH J, et al. SAP HANA database: Data management for modern business applications [J]. ACM Sigmod Record, 2012, 40(4): 45-51.
[ 6 ] GLIGOR G, TEODORU S. Oracle exalytics: Engineered for speed-of-thought analytics [J]. Database Systems Journal, 2011, 2(4): 3-8.
[ 7 ] WANG L, ZHOU M Q, ZHANG Z J, et al. Elastic pipelining in in-memory DataBase cluster [R]. 2016.
[ 8 ] TRAVERSO M. Presto: Interacting with petabytes of data at Facebook [EB/OL].(2013-11-07)[2016-06-10]. http://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.
[ 9 ] ARMBRUST M, XIN R S, LIAN C, et al. Spark SQL: Relational data processing in spark [C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1383-1394.
[10] YANG F, TSCHETTER E, LEAUTE X, et al. Druid: A real-time analytical data store [C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014.
[11] GARCIA-MOLINA H, ULLMAN J D, WIDOM J. Database System Implementation [M]. Upper Saddle River, NJ: Prentice Hall, 2000.
[12] NAGA P N. Real-time Analytics at Massive Scale with Pinot [EB/OL]. [2016-06-10]. https://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot.
[13] KREPS J, NARKHEDE N, RAO J, et al. Kafka: A distributed messaging system for log processing [C]//Proceedings of the NetDB. 2011: 1-7.
[14] LAMB A, FULLER M, VARADARAJAN R, et al. The vertica analytic database: C-store 7 years later [C]//Proceedings of the VLDB Endowment. 2012: 1790-1801.
[15] CHANG L, WANG Z, MA T, et al. Hawq: A massively parallel processing sql engine in hadoop [C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014.
[16] STONEBRAKER M, WEISBERG A. The VoltDB main memory DBMS [J]. IEEE Data Eng Bull, 2013: 21-27.
[17] BRYANT R E, O′HALLARON D R. 深入理解计算机系统[M], 北京:机械工业出版社,2013.
[18] ESWARAN K P, GRAY J N, LORIE R A, et al. The notions of consistency and predicate locks in a database system [J]. Communications of the ACM, 1976, 19(11): 624-633.
[19] STONEBRAKER M. One Size Fits None-(Everything You Learned in Your DBMS Class is Wrong) [R/OL]. (2013-05-30)[2016-07-01]. http://slideshot.epfl.ch/talks/166.
[20] WEIKUM G, VOSSEN G. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery [M]. San Francisco: Morgan Kaufmann Publishers, 2002.
[21] DIACONU C, FREEDMAN C, ISMERT E, et al. Hekaton: SQL server’s memory-optimized OLTP engine [C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013.
[22] MICHAEL M M. High performance dynamic lock-free hash tables and list-based sets [C]//Proceedings of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures. 2002: 73-82.
[23] LAMPSON BW, STURGIS H E. Crash Recovery in a Distributed Data Storage System [R]. Palo Alto, California: Xerox Palo Alto Research Center, 1979.
[24] SKEEN D. Nonblocking commit protocols [C]//Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data. 1981.
[25] HAN J, HAIHONG E, LE G, et al. Survey on NoSQL database [C]//Proceedings of the 2011 6th International Conference on Pervasive Computing and Applications. 2011: 363-366.
[26] O’NEIL E J, O’NEIL P E, WEIKUM G. The LRU-K page replacement algorithm for database disk buffering [C]//Proceedings of the ACM SIGMOD International Conference on Management of Data. 1993: 297-306.
/
〈 |
|
〉 |