计算机科学

分布式内存数据库系统的容错管理

  • 赵镇辉 ,
  • 黄承晟 ,
  • 周敏奇 ,
  • 周傲英
展开
  • 华东师范大学 数据科学与工程研究院, 上海  200062

收稿日期: 2016-06-27

  网络出版日期: 2016-11-29

基金资助

国家自然科学基金重点项目(61332006); 上海市基金(13ZR1413200)

Fault-tolerance in distributed in-memory database systems

  • ZHAO Zhen-hui ,
  • HUANG Cheng-shen ,
  • ZHOU Min-qi ,
  • ZHOU Ao-ying
Expand
  • Institute for Data Science and Engineering, East China Normal University, Shanghai 200062, China

Received date: 2016-06-27

  Online published: 2016-11-29

摘要

在大数据背景下, 分布式系统被企业广泛部署和应用, 随着分布式系统节点规模的扩大, 系统故障的概率也将随之增加, 在分布式系统中引入容错机制, 对提升分布式系统可用性、可靠性、可恢复性至关重要. CLAIMS系统是面向金融领域的对实时数据进行实时分析的内存数据库系统------在数据不断注入系统时, 提供近实时的查询、分析任务. 本文主要探讨CLAIMS系统中容错机制. 依据租约机制, 实现系统中异常节点的快速发现及标记(即Fail-fast). 在标记异常节点之后, 实现对受影响分析任务的重启(即Fail-over); 对异常节点全局内存状态的恢复(即Fail-back). 实验结果表明, 本文所提算法能够较好地实现CLAIMS系统的容错特性.

本文引用格式

赵镇辉 , 黄承晟 , 周敏奇 , 周傲英 . 分布式内存数据库系统的容错管理[J]. 华东师范大学学报(自然科学版), 2016 , 2016(5) : 27 -35 . DOI: 10.3969/j.issn.1000-5641.2016.05.004

Abstract

In the big data era, distributed system has been widely deployed and applied in various fields. Nevertheless, the more nodes involved, the higher probability of system failures may occur. It is important to introduce fault-tolerance mechanism for distributed systems to achieve even higher performance, higher reliability and higher availability. CLAIMS system is an in-memory database system for real-time data analysis, which is mainly used for financial applications. It provides near real time query task and analytic task. This paper mainly discuss fault-tolerance mechanism in CLAIMS. Achieve lease-based quick system failure detection (Fail-fast). Achieve restart of affected analytic task after detecting failure (Fail-over). Achieve in-memory state recovery of abnormal node. Experiment indicate that the algorithm presented in this paper can achieve fault-tolerance in CLAIMS.

参考文献

[ 1 ] TANENBAUM A S, STEEN M V. Distributed systems principles and paradigms[J]. Acm, 2002, 87(3): 65-73.
[ 2 ] COULOURIS G, DOLLIMORE J, KINDBERG T, et al. Distributed Systems: Concepts and Design. [M]. 5th ed. New Jersey: Addison-Wesley, 2012: 37-76.
[ 3 ] 王立. 分布式内存数据库系统的查询处理与优化[D]. 上海: 华东师范大学, 2015.
[ 4 ] GRAY C, CHERITON D. Leases: An efficient fault-tolerant mechaism for distributed file cache consistency[J]. Acm Sigops Operating Systems Review, 1989, 23(5): 202-210.
[ 5 ] CHAROUSSET D, HIESGEN R, SCHMIDT T C. CAF-the C++ actor framework for scalable and resource-efficient applications[C]. New York: ACM, 2014: 15-28.
[ 6 ] CASTRO M, LISKOV B. Practical byzantine fault tolerance and proactive recovery[J]. Acm Transactions on Computer Systems, 2002, 20(4): 398-461.
[ 7 ] BORTHAKUR D. The hadoop distributed file system: Architecture and design[J]. Hadoop Project Website, 2007, 11(11): 1-10.
[ 8 ] 关国栋, 滕飞, 杨燕. 基于心跳超时机制的Hadoop实时容错技术[J]. 计算机应用, 2015, 35(10): 2784-2788.
[ 9 ] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2012: 141-146.
[10] 林春. 分布式内存数据库的恢复[J]. 航空计算技术, 2003, 33(2): 90-92.

文章导航

/