Journal of East China Normal University(Natural Science) >
Separate management strategies for Part metadata under the storage-computing separation architecture
Received date: 2023-06-30
Accepted date: 2023-07-24
Online published: 2023-09-20
To address the deficiencies of ClickHouse, including underutilization of hardware resources, lack of flexibility, and slow node startup, this paper proposes metadata management strategies under the storage-compute separation architecture, which focuses on the description of data information through Part metadata. Part metadata are the most crucial component of metadata. To effectively manage data on remote shared storage, this study collected all Part metadata files and merged them. After key-value mapping, serialization, and deserialization processes, the merged metadata were stored in a distributed key-value database. Furthermore, a synchronization strategy was designed to ensure consistency between the data on remote shared storage and the metadata in the distributed key-value database. By implementing the above strategies, a metadata management system was developed for Part metadata, which effectively addressed the slow node startup issue in ClickHouse and supported efficient dynamic scaling of nodes.
Danqi LIU , Peng CAI . Separate management strategies for Part metadata under the storage-computing separation architecture[J]. Journal of East China Normal University(Natural Science), 2023 , 2023(5) : 40 -50 . DOI: 10.3969/j.issn.1000-5641.2023.05.004
1 | HAN W S, NG J, MARKL V, et al. Progressive optimization in a shared-nothing parallel database [C/OL]// Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. Beijing: ACM, 2007: 809-820. [2023-05-16]. https://dl.acm.org/doi/10.1145/1247480.1247569. |
2 | VUPPALAPATI M, TRUONG D, MIRON J, et al. Building an elastic query engine on disaggregated storage [C]// Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation. Santa Clara, CA, USA, 2020. |
3 | DAGEVILLE B, CRUANES T, ZUKOWSKI M, et al. The snowflake elastic data warehouse [C/OL]// Proceedings of the 2016 International Conference on Management of Data. San Francisco, CA, USA: ACM, 2016: 215-226. [2023-05-16]. https://dl.acm.org/doi/10.1145/2882903.2903741. |
4 | CLICKHOUSE. Fast Open-Source OLAP DBMS [EB/OL]. [2023-05-16]. https://clickhouse.com/. |
5 | GAO P X, NARAYAN A, AGARWAL R, et al. Network requirements for resource disaggregation [C]// Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. Savannah, GA, USA, 2016. |
6 | GU J, LEE Y, ZHANG Y, et al. Efficient memory disaggregation with INFINISWAP [C]// Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation. Boston, MA, USA, 2017. |
7 | DONG S, CALLAGHAN M, GALANIS L, et al. Optimizing space ampli?cation in RocksDB [C/OL]// 8th Biennial Conference on Innovative Data Systems Research. Chaminade, CA, USA, 2017. [2023-05-16]. https://lrita.github.io/images/posts/database. |
8 | ZHOU J, XU M, SHRAER A, et al.. FoundationDB: A Distributed key value Store. ACM SIGMOD Record, 2022, 51 (1): 24- 31. |
9 | CAMACHO-RODRÍGUEZ J, CHAUHAN A, GATES A, et al. Apache hive: From MapReduce to enterprise-grade big data warehousing [EB/OL]. (2019-03-26)[2023-05-16]. http://arxiv.org/abs/1903.10970. |
10 | ARMENATZOGLOU N, BASU S, BHANOORI N, et al. Amazon redshift re-invented [C/OL]// Proceedings of the 2022 International Conference on Management of Data. New York, NY, USA: Association for Computing Machinery, 2022: 2205-2217. [2023-05-29]. https://dl.acm.org/doi/10.1145/3514221.3526045. |
11 | GUPTA A, AGARWAL D, TAN D, et al. Amazon redshift and the case for simpler data warehouses [C/OL]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Melbourne Victoria, Australia: ACM, 2015: 1917-1923. [2023-06-19]. https://dl.acm.org/doi/10.1145/2723372.2742795. |
12 | HALEVY A, KORN F, NOY N F, et al. Goods: Organizing Google’s datasets [C/OL]// Proceedings of the 2016 International Conference on Management of Data. San Francisco, CA, USA: ACM, 2016: 795-806. [2023-05-16]. https://dl.acm.org/doi/10.1145/2882903.2903730. |
13 | EDARA P, PASUMANSKY M. Big metadata: When metadata is big data[J]. Proceedings of the VLDB Endowment 14(12): 3083-3095. |
14 | ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A Fault-tolerant abstraction for in-memory cluster computing [C]// Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Publication History, 2012. |
15 | AMAZON. Amazon simple storage service (user guide) [Z/OL]. (2006-03-01)[2023-06-09]. https://docs.aws.amazon.com/AmazonS3/latest/userguide/. |
16 | BORTHAKUR D. HDFS architecture guide [Z/OL]. (2022-05-18)[2023-06-19]. https://docs.huihoo.com/apache/hadoop/1.0.4/hdfs_design.pdf. |
17 | MICROSOFT AZURE. Azure blob storage [EB/OL]. [2023-05-16]. https://azure.microsoft.com/en-us/products/storage/blobs. |
18 | MACKEY G, SEHRISH S, WANG J. Improving metadata management for small files in HDFS [C]// 2009 IEEE International Conference on Cluster Computing and Workshops. New Orleans, LA, USA, 2009. |
/
〈 |
|
〉 |