物化是列存储数据库查询中必不可少的操作,物化策略和物化技术在查询执行过程中起着至关重要的作用.因此设计一种针对列存储数据库的物化策略尤为重要.提前物化生成的元组中存在无关属性;而延迟物化对选择率较高的查询可能无法优化其性能,且某些列会被访问多次.针对以上缺点,本文提出了有别于上述两种策略的策略——Smart物化策略.本文提出了在逻辑查询计划中使用结构——projection,该结构是由用户选取查询所需的属性来生成的,相当于对全表进行物理上的切分;在查询开始时,能减少直接加载到内存的数据量,避免额外的开销.在构建逻辑查询计划过程中,Smart物化策略将projection作为扫描操作标准来对数据进行按列划分,根据一组语句集中对列访问的相关性来对下一次查询所需要的列进行预测,将所需要的列加入到一个最合适的projection中来进行物化.本文通过在分布式内存数据库CLAIMS上使用TPC-H数据集来验证其有效性.
Materialization is a necessary operation in the process of query execution. Materialization strategy and materialization technology play an important role in the process of query execution. Therefore, it is necessary to design a materialization strategy for column-store database. According to the shortcomings of early materialization and later materialization, we provide a strategy named Smart materialization that are different from the two strategies mentioned above. Here we need to define a concept in the logical query plan-projection, the structure is used to select the desired attributes, the physical table is cut by column, to ensure that the structure at the beginning of the query can reduce the direct load to memory of the amount of data, to avoid additional overhead. In the logical query plan, the projection is divided by columns, and the next required columns are predicted according to the relevance of the query in a set of queries, and the required columns are stabilized in one of the most appropriate projection. We use the data set of TPC-H to verify its validity worked on the disturbed in-memory database-CLAIMS.
[1] STONEBRAKER M, ABADI D J, BATKIN A, et al. C-store:A column-oriented DBMS[C]//International Conference on Very Large Data Bases. DBLP, 2005:553-564.
[2] COPELAND G P, KHOSHAFIAN S N. A decomposition storage model[C]//ACM SIGMOD International Conference on Management of Data. ACM, 1985:268-279.
[3] STONEBRAKER M, ETINTEMEL U. "One Size Fits All":An idea whose time has come and gone[C]//International Conference on Data Engineering. IEEE, 2005:2-11.
[4] CORMACK G V. Data compression on a database system[J]. Communications of the ACM, 1985, 28(12):1336-1342.
[5] 黄鹏, 李占山, 张永刚, 等. 基于列存储数据库的压缩态数据访问算法[J]. 吉林大学学报(理学版), 2009, 47(5):1013-1019.
[6] Google Snappy[EB/OL].[2017-04-01]. https://github.com/google/snappy.
[7] WANG L, ZHOU M, ZHANG Z, et al. Elastic pipelining in an in-memory database cluster[C]//ACM SIGMOD. ACM, 2016:1279-1294.
[8] SIKKA V, RBER F, GOEL A, et al. SAP HANA:The evolution from a modern main-memory data platform to an enterprise application platform[J]. Proceedings of the Vldb Endowment, 2013, 6(11):1184-1185.
[9] ABADI D, MADDEN S, FERREIRA M. Integrating compression and execution in column-oriented database systems[C]//ACM SIGMOD International Conference on Management of Data. DBLP, 2006:671-682.
[10] ABADI D J, MYERS D S, DEWITT D J, et al. Materialization strategies in a column-oriented DBMS[C]//2007 IEEE 23rd International Conference on Data Engineering. IEEE, 2007:466-475.
[11] CORNELL D W, YU P S. An effective approach to vertical partitioning for physical design of relational databases[J]. IEEE Transactions on Software Engineering, 1990, 16(2):248-258.
[12] BRYANT R E, HALLARON D R O'. 深入理解计算机系统[M].龚奕利,雷迎春,译.北京:机械工业出版社,2011.
[13] IDREOS S, KERSTEN M L, MANEGOLD S. Self-organizing tuple reconstruction in column-stores[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2009:297-308.
[14] 杨传辉. 大规模分布式存储系统[M]. 北京:机械工业出版社, 2013.
[15] BRUNO N, CHAUDHURI S. To tune or not to tune?:A lightweight physical design alerter[C]//International Conference on Very Large Data Bases. VLDB Endowment, 2006:499-510.
[16] TPC Benchmark H[EB/OL].[2017-04-01]. http://www.tpc.org/tpch/.