计算机科学与技术

大规模分布并行计算系统容错与恢复技术

  • 张新洲 ,
  • 周敏奇
展开
  • 华东师范大学 软件学院,上海 20062
张新洲,男,硕士研究生,研究方向为内存数据库系统. Email: 370490819@qq.com

网络出版日期: 2014-11-27

基金资助

国家自然科学基金(61332006)

Fault tolerance recovery techniques in large distributed parallel computing system

  • ZHANG Xin-Zhou ,
  • ZHOU Min-Qi
Expand
  • Software Engineering Institute, East China Normal University, Shanghai 200062, China

Online published: 2014-11-27

摘要

当前,拥有超级计算能力的计算机系统通常是大型商用系统形成计算机集群.与所有的分布式系统一样,这些系统通过独立的计算机硬件协同合作共同实现超级计算的能力.然而在拥有超级计算能力的同时,集群中的任何一个组件随时都可能失效,从而导致错的输出.为了提高集群在系统出现故障的情况下的鲁棒性,许多容错技术已经被设计和实现,用以处理各种类型的系统故障.本文对各种现有的容错技术进行了总结归纳,以便在此基础之上进行进一步的研究从而适应当前环境下的系统容错.

关键词: 容错; 并行计算; 集群

本文引用格式

张新洲 , 周敏奇 . 大规模分布并行计算系统容错与恢复技术[J]. 华东师范大学学报(自然科学版), 2014 , 2014(5) : 207 -215 . DOI: 10.3969/j.issn.10005641.2014.05.018

Abstract

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately,any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures,many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault tolerance techniques.
文章导航

/