数据学习系统

FeaDB: 基于内存的多版本在线特征存储

  • 高歌 ,
  • 胡卉芪
展开
  • 华东师范大学 数据科学与工程学院, 上海 200062

收稿日期: 2023-06-30

  录用日期: 2023-07-26

  网络出版日期: 2023-09-20

基金资助

上海市自然科学基金(23ZR1418300)

FeaDB: In-memory based multi-version online feature store

  • Ge GAO ,
  • Huiqi HU
Expand
  • School of Data Science and Engineering, East China Normal University, Shanghai 200062, China

Received date: 2023-06-30

  Accepted date: 2023-07-26

  Online published: 2023-09-20

摘要

特征管理是搭建人工智能数据管道中的重要一环. 特征存储要求在模型训练和推理阶段提供有效版本的特征推送服务. 为响应这一需求, 特征存储需要为特征实时更新和版本管理提供保证, 以协同上游的特征摄取, 为模型服务系统提供数据动力. 在人工智能辅助决策的在线预测任务中, 为了提供更好的用户体验, 模型服务系统需要实时响应决策请求, 实时特征检索面临更低延迟的挑战. 聚焦这一挑战, 开发基于内存的多版本在线特征存储FeaDB. 使用时间序列建模特征, 并提供特征版本管理语义, 满足特征从生产到消费的版本管理需求; 采用追加写方式保证实时特征加载性能, 设计基于版本的索引减少读延迟; 为进一步减小特征消费延迟, 提出版本快照机制, 实验证明采用快照读机制增加了特征集版本的检索效率.

本文引用格式

高歌 , 胡卉芪 . FeaDB: 基于内存的多版本在线特征存储[J]. 华东师范大学学报(自然科学版), 2023 , 2023(5) : 65 -76 . DOI: 10.3969/j.issn.1000-5641.2023.05.006

Abstract

Feature management plays an important role in the AI(artificial intelligence) pipeline. Feature stores are designed to offer effective versioning of features during the model training and inference stages. Feature stores must ensure real-time feature updates and version management to collaborate with the upstream data ingestion tasks and power the model serving system. In AI-powered online decision augmentation applications, the model serving system responds to requests in real time to provide better user experience, and feature stores face the challenge of low-latency online feature retrieval. Focusing on this challenge, we developed FeaDB, an in-memory based multi-version online feature store, which adopts a time series model and provides feature versioning semantics to automatically manage features from ingestion to serving. Moreover, an append-write operation was applied to ensure ingestion performance, and version indexing was optimized to improve read operations. A snapshot mechanism is proposed, and it was experimentally proven that snapshot read operations improve performance of lookup and range lookup.

参考文献

1 ZAHARIA M, CHEN A, DAVIDSON A, et al.. Accelerating the machine learning lifecycle with MLflow. IEEE Data Engineering Bulletin, 2018, 41 (4): 39- 45.
2 LUO Z, YEUNG S H, ZHANG M, et al. MLCask: Efficient management of component evolution in collaborative data analytics pipelines [C]// 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021: 1655-1666.
3 SCHLEGEL M, SATTLER K U.. Management of machine learning lifecycle artifacts: A survey. ACM SIGMOD Record, 2023, 51 (4): 18- 35.
4 GHARIBI G, WALUNJ V, RELLA S, et al. Modelkb: Towards automated management of the modeling lifecycle in deep learning [C]// 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 2019: 28-34.
5 SCULLEY D, HOLT G, GOLOVIN D, et al.. Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 2015, 2, 2503- 2511.
6 JEREMY H, MIKE D B. Meet Michelangelo: Uber’s Machine Learning Platform [EB/OL]. (2017-09-05)[2023-06-30]. https://www.uber.com/en-TW/blog/michelangelo-machine-learning-platform/.
7 WILLEM P, MIKE D. Feast: An open source feature store for machine learning [EB/OL]. (2021-01-21)[2023-06-30]. https://feast.dev/blog/what-is-a-feature-store/.
8 CHEN C, YANG J, LU M, et al.. Optimizing in-memory database engine for AI-powered on-line decision augmentation using persistent memory. Proceedings of the VLDB Endowment, 2021, 14 (5): 799- 812.
9 ORMENISAN A A, ISMAIL M, HAMMAR K, et al. Horizontally scalable ml pipelines with a feature store [C]// Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA, 2019.
10 FAGIN R, NIEVERGELT J, PIPPENGER N, et al.. Extendible Hashing—a fast access method for dynamic files. ACM Transactions on Database Systems (TODS), 1979, 4 (3): 315- 344.
11 NETFLIX. System architectures for personalization and recommendation [EB/OL]. (2013-03-27)[2023-06-30]. https://netflixtechblog.com/system-architectures-for-personalization-and-recommendation-e081aa94b5d8/.
12 ARVAZ K, ZOHAIB H. Building a gigascale ML feature store with redis, binary serialization, string Hashing, and compression [EB/OL]. (2020-11-19)[2023-06-30]. https://doordash.engineering/2020/11/19/building-a-gigascale-ml-feature-store-with-redis/.
13 SARAH W. Ralf [EB/OL]. (2022-03-13)[2023-06-30]. https://github.com/feature-store/ralf/.
14 MOHANTY P, KRISHNASWAMY S, CHOI E. Automated Cache Hierarchy for Feature Stores [R]. CA: University of California, Berkeley, 2021.
15 ORR L, SANYAL A, LING X, et al. Managing ML pipelines: Feature stores and the coming wave of embedding ecosystems [EB/OL]. (2021-08-11)[2023-06-30]. https://arxiv.org/pdf/2108.05053.pdf.
文章导航

/