Journal of East China Normal University(Natural Science) >
Hybrid granular buffer management scheme for storage and computing separation architecture
Received date: 2023-07-01
Online published: 2023-09-20
The architecture of storage-compute separation has emerged as a solution for improving the performance and efficiency of large-scale data processing. However, there are notable performance bottlenecks in this approach, primarily due to the low access efficiency of object storage and the significant network overhead. Additionally, object storage exhibits low storage efficiency for small-sized files. For instance, ClickHouse, a MergeTree-based database, generates a plethora of small-sized files when storing data. To address these challenges, HG-Buffer (hybrid granularity buffer) is introduced as an SSD (solid state driver)-based caching management solution for optimizing the storage-compute separation in ClickHouse and S3, while also tackling the small-file issue in object storage. The primary objective of HG-Buffer is to minimize network transmission overhead and enhance system access efficiency. This is achieved by introducing SSD as a caching layer between the compute and storage layers and organizing the SSD buffer into two granularities: object buffer and block buffer. The object buffer granularity corresponds to the data granularity in object storage, while the block buffer granularity represents the data granularity accessed by the system, with the block buffer granularity being a subset of the object buffer granularity. By statistically analyzing data hotness information, HG-Buffer adaptively selects the storage location for data, improving SSD space utilization and system performance. Experimental evaluations conducted on ClickHouse and S3 demonstrate the effectiveness and robustness of HG-Buffer.
Wenjuan MEI , Peng CAI . Hybrid granular buffer management scheme for storage and computing separation architecture[J]. Journal of East China Normal University(Natural Science), 2023 , 2023(5) : 26 -39 . DOI: 10.3969/j.issn.1000-5641.2023.05.003
1 | ARMBRUST M, DAS T, SUN L W, et al.. Delta lake: High-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 2020, 13 (12): 3411- 3424. |
2 | Google BigQuery [EB/OL]. [2023-05-18]. https://cloud.google.com/bigquery/2022. |
3 | CAMACHO-RODRÍGUEZ J, CHAUHAN A, GATES A, et al. Apache hive: From mapreduce to enterprise-grade big data warehousing [C]// Proceedings of the 2019 International Conference on Management of Data. ACM, 2019: 1773-1786. |
4 | COOPER B F, NARAYAN P P S, RAMAKRISHNAN R, et al.. PNUTS to sherpa: Lessons from yahoo!’s cloud database. Proceedings of the VLDB Endowment, 2019, 12 (12): 2300- 2307. |
5 | DAGEVILLE B, CRUANES T, ZUKOWSKI M, et al. The snowflake elastic data warehouse [C]// Proceedings of the 2016 International Conference on Management of Data. ACM, 2016: 215-226. |
6 | Amazon Simple Storage Service(S3) [EB/OL]. [2023-05-18]. http://aws.amazon.com/s3/. |
7 | Azure Blob Storage [EB/OL]. [2023-05-18]. https://azure.microsoft.com/enus/services/storage/blobs/. |
8 | Alibaba Object Storage Service [EB/OL]. [2023-05-18]. https://www.alibabacloud.com/product/oss/. |
9 | Google Cloud Storage [EB/OL]. [2023-05-18]. https://cloud.google.com/storage/. |
10 | KARUN A K, CHITHARANJAN K. A review on hadoop—HDFS infrastructure extensions [C]// 2013 IEEE Conference on Information & Communication Technologies. IEEE, 2013: 132-137. |
11 | VUPPALAPATI M, MIRON J, AGARWAL R, et al. Building an elastic query engine on disaggregated storage[C]// Proceedings of the 17th USENIX Conference on Networked Systems Design and Implementation. Berkeley, CA USA: USENIX Association, 2020: 449-462. |
12 | Alluxio Cache Service—Presto 0.282 Documentation [EB/OL]. [2023-05-18]. https://prestodb.io/docs/current/cache/service.html. |
13 | GUPTA A, AGARWAL D, TAN D, et al. Amazon redshift and the case for simpler data warehouses [C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1917-1923 |
14 | Amazon Redshift [EB/OL]. [2023-05-18]. https://aws.amazon.com/redshift/. |
15 | GADBAN F, KUNKEL J.. Analyzing the performance of the S3 object storage API for HPC workloads. Applied Sciences, 2021, 11 (18): 8540. |
16 | CATO P. A simple approach to optimize S3 object gateways for massive numbers of small file writes [C]// 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022: 3186-3190. |
17 | ARMBRUST M, XIN R S, LIAN C, et al. Spark SQL: Relational data processing in spark [C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1383-1394. |
18 | VERBITSKI A, GUPTA A, SAHA D, et al. Amazon aurora: Design considerations for high throughput cloud-native relational databases [C]// Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 2017: 1041-1052. |
19 | VERBITSKI A, GUPTA A, SAHA D, et al. Amazon aurora: On avoiding distributed consensus for I/Os, commits, and membership changes [C]// Proceedings of the 2018 International Conference on Management of Data. ACM, 2018: 789-796. |
20 | Amazon Athena — Serverless Interactive Query Service [EB/OL]. [2023-5-18]. https://aws. amazon.com/athena/. |
21 | LAMB A, FULLER M, VARADARAJAN R, et al.. The vertica analytic database: C-store 7 years later. Proceedings of the VLDB Endowment, 2012, 5 (12): 1790- 1801. |
22 | VANDIVER B, PRASAD S, RANA P, et al. Eon mode: Bringing the vertica columnar database to the cloud [C]// Proceedings of the 2018 International Conference on Management of Data. ACM, 2018: 797-809. |
23 | DURNER D, CHANDRAMOULI B, LI Y N.. Crystal: A unified cache storage system for analytical databases. Proceedings of the VLDB Endowment, 2021, 14 (11): 2432- 2444. |
24 | YANG Y F, YOUILL M, WOICIK M, et al.. FlexPushdownDB: Hybrid pushdown and caching in a cloud DBMS. Proceedings of the VLDB Endowment, 2021, 14 (11): 2101- 2113. |
25 | Fast Open-Source OLAP DBMS – ClickHouse [EB/OL]. [2023-5-18]. https://clickhouse.com. |
26 | PALANKAR M R, IAMNITCHI A, RIPEANU M, et al. Amazon S3 for science grids: A viable solution? [C]// Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing. ACM, 2008: 55-64. |
27 | BONCZ P, NEUMANN T, ERLING O. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark [C]// Performance Characterization and Benchmarking, TPCTC 2013. Lecture Notes in Computer Science, vol 8391. Cham: Springer, 2013: 61-76. |
28 | ROVIO E, ESKOLA J, KOZUB S A, et al. Can high group cohesion be harmful?: A case study of a junior ice-hockey team[J]. Small Group Research, 2009, 40(4): 421-435 |
/
〈 |
|
〉 |