华东师范大学学报(自然科学版) ›› 2023, Vol. 2023 ›› Issue (5): 26-39.doi: 10.3969/j.issn.1000-5641.2023.05.003

• 数据库系统 • 上一篇    下一篇

面向存算分离架构的混合粒度缓存策略

梅文娟, 蔡鹏*()   

  1. 华东师范大学 数据科学与工程学院, 上海 200062
  • 收稿日期:2023-07-01 出版日期:2023-09-25 发布日期:2023-09-15
  • 通讯作者: 蔡鹏 E-mail:pcai@dase.ecnu.edu.cn
  • 基金资助:
    国家自然科学基金(61972149, U22B2020)

Hybrid granular buffer management scheme for storage and computing separation architecture

Wenjuan MEI, Peng CAI*()   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2023-07-01 Online:2023-09-25 Published:2023-09-15
  • Contact: Peng CAI E-mail:pcai@dase.ecnu.edu.cn

摘要:

存储计算分离方案已成为一种提高大规模数据处理性能及效率的系统架构, 但其存储层的访问效率低、网络开销大、对小文件不友好, 存在着极大的性能瓶颈. 基于MergeTree的数据库ClickHouse在数据存储过程中会产生很多小文件. ClickHouse和S3存算分离方案中文件粒度固定的SSD (solid state driver)缓存区不仅和内存数据不匹配, 还会造成缓存区空间浪费. 提出了一种面向存算分离架构的缓存管理方案HG-Buffer (hybrid granularity buffer), 旨在优化ClickHouse和S3的存储计算分离方案以及对象存储的小文件问题, 以提高缓存空间的利用率, 从而提高系统访问效率. HG-Buffer通过将SSD作为计算层和存储层之间的缓存层, 并将 SSD 缓冲区组织成两个粒度的缓冲区来实现: 对象缓冲区和块缓冲区。对象缓存粒度是对象存储中的数据粒度; 而块缓存粒度是系统访问数据的数据粒度, 其中块缓存粒度是对象缓存粒度的子集. HG-Buffer通过统计数据热度信息, 自适应地选择数据存储的位置, 以提高SSD空间的利用率, 从而提高系统性能. 在ClickHouse和S3上进行的实验评估证明了HG-Buffer的有效性和稳健性.

关键词: 存算分离, 多粒度缓存管理, 固态硬盘缓存

Abstract:

The architecture of storage-compute separation has emerged as a solution for improving the performance and efficiency of large-scale data processing. However, there are notable performance bottlenecks in this approach, primarily due to the low access efficiency of object storage and the significant network overhead. Additionally, object storage exhibits low storage efficiency for small-sized files. For instance, ClickHouse, a MergeTree-based database, generates a plethora of small-sized files when storing data. To address these challenges, HG-Buffer (hybrid granularity buffer) is introduced as an SSD (solid state driver)-based caching management solution for optimizing the storage-compute separation in ClickHouse and S3, while also tackling the small-file issue in object storage. The primary objective of HG-Buffer is to minimize network transmission overhead and enhance system access efficiency. This is achieved by introducing SSD as a caching layer between the compute and storage layers and organizing the SSD buffer into two granularities: object buffer and block buffer. The object buffer granularity corresponds to the data granularity in object storage, while the block buffer granularity represents the data granularity accessed by the system, with the block buffer granularity being a subset of the object buffer granularity. By statistically analyzing data hotness information, HG-Buffer adaptively selects the storage location for data, improving SSD space utilization and system performance. Experimental evaluations conducted on ClickHouse and S3 demonstrate the effectiveness and robustness of HG-Buffer.

Key words: storage-computing separation, hybrid granular cache management, solid state driver (SSD) cache

中图分类号: