ModelarDB: Modular Model-based Time Series Management with Spark and Cassandra

Søren Kejser Jensen; Torben Bach Pedersen; Christian Thomsen

doi:10.14778/3236187.3236215

ModelarDB: Modular Model-based Time Series Management with Spark and Cassandra

Søren Kejser Jensen, Torben Bach Pedersen, Christian Thomsen

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

22 Citationer (Scopus)

Abstract

Industrial systems, e.g., wind turbines, generate big amounts of data from reliable sensors with high velocity. As it is unfeasible to store and query such big amounts of data, only simple aggregates are currently stored. However, aggregates remove fluctuations and outliers that can reveal underlying problems and limit the knowledge to be gained from historical data. As a remedy, we present the distributed Time Series Management System (TSMS) ModelarDB that uses models to store sensor data. We thus propose an online, adaptive multi-model compression algorithm that maintains data values within a user-defined error bound (possibly zero). We also propose (i) a database schema to store time series as models, (ii) methods to push-down predicates to a key-value store utilizing this schema, (iii) optimized methods to execute aggregate queries on models, (iv) a method to optimize execution of projections through static code-generation, and (v) dynamic extensibility that allows new models to be used without recompiling the TSMS. Further, we present a general modular distributed TSMS architecture and its implementation, ModelarDB, as a portable library, using Apache Spark for query processing and Apache Cassandra for storage. An experimental evaluation shows that, unlike current systems, ModelarDB hits a sweet spot and offers fast ingestion, good compression, and fast, scalable online aggregate query processing at the same time. This is achieved by dynamically adapting to data sets using multiple models. The system degrades gracefully as more outliers occur and the actual errors are much lower than the bounds.

Originalsprog	Engelsk
Tidsskrift	Proceedings of the VLDB Endowment
Vol/bind	11
Udgave nummer	11
Sider (fra-til)	1688-1701
Antal sider	14
ISSN	2150-8097
DOI	https://doi.org/10.14778/3236187.3236215
Status	Udgivet - 1 jul. 2018

Adgang til dokumentet

10.14778/3236187.3236215

http://www.vldb.org/pvldb/vol11/p1688-jensen.pdf

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

http://www.scopus.com/inward/record.url?scp=85058895804&partnerID=8YFLogxK

22 Citationer
3 Konferenceartikel i proceeding
2 Bidrag til bog/antologi

ModelarDB: Integrated Model-Based Management of Time Series from Edge to Cloud
Jensen, S. K., Thomsen, C. & Pedersen, T. B., 9 feb. 2023, Transactions on Large-Scale Data- and Knowledge-Centered Systems LIII. Hameurlain, A. & Tjoa, A. M. (red.). Springer, s. 1-33 33 s. (Transactions on Large-Scale Data- and Knowledge-Centered Systems). (Lecture Notes in Computer Science, Bind 13840).
Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Bidrag til bog/antologi › Forskning › peer review

Åben adgang
Fil
21 Downloads (Pure)
Machine Learning Platform for Extreme Scale Computing on Compressed IoT Data
Tirupathi, S., Salwala, D., Zizzo, G., Rawat, A., Purcell, M., Jensen, S. K., Thomsen, C., Ho, N., Cuza, C. E. M., Brusokas, J., Pedersen, T. B., Alexiou, G., Giannopoulos, G., Gidarakos, P., Kalimeris, A., Maroulis, S., Papastefanatos, G., Psarros, I., Stamatopoulos, V. & Terrovitis, M., 20 dec. 2022, 2022 IEEE International Conference on Big Data (Big Data). Tsumoto, S., Ohsawa, Y., Chen, L., Van den Poel, D., Hu, X., Motomura, Y., Takagi, T., Wu, L., Xie, Y., Abe, A. & Raghavan, V. (red.). IEEE Communications Society, s. 3179-3185 7 s. 10020540
Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review
1 Citationer (Scopus)
Time Series Management Systems: A 2022 Survey
Jensen, S. K., Pedersen, T. B. & Thomsen, C., 4 dec. 2022, (Accepteret/In press) Data Series Management and Analytics. Palpanas, T. & Zoumpatianos, K. (red.). Association for Computing Machinery, 81 s.
Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Bidrag til bog/antologi › Forskning › peer review

Åben adgang
Fil

Citationsformater

@article{e0e45ae60fa543389648aa443314eb8b,

title = "ModelarDB: Modular Model-based Time Series Management with Spark and Cassandra",

abstract = "Industrial systems, e.g., wind turbines, generate big amounts of data from reliable sensors with high velocity. As it is unfeasible to store and query such big amounts of data, only simple aggregates are currently stored. However, aggregates remove fluctuations and outliers that can reveal underlying problems and limit the knowledge to be gained from historical data. As a remedy, we present the distributed Time Series Management System (TSMS) ModelarDB that uses models to store sensor data. We thus propose an online, adaptive multi-model compression algorithm that maintains data values within a user-defined error bound (possibly zero). We also propose (i) a database schema to store time series as models, (ii) methods to push-down predicates to a key-value store utilizing this schema, (iii) optimized methods to execute aggregate queries on models, (iv) a method to optimize execution of projections through static code-generation, and (v) dynamic extensibility that allows new models to be used without recompiling the TSMS. Further, we present a general modular distributed TSMS architecture and its implementation, ModelarDB, as a portable library, using Apache Spark for query processing and Apache Cassandra for storage. An experimental evaluation shows that, unlike current systems, ModelarDB hits a sweet spot and offers fast ingestion, good compression, and fast, scalable online aggregate query processing at the same time. This is achieved by dynamically adapting to data sets using multiple models. The system degrades gracefully as more outliers occur and the actual errors are much lower than the bounds.",

author = "Jensen, {S{\o}ren Kejser} and Pedersen, {Torben Bach} and Christian Thomsen",

year = "2018",

month = jul,

day = "1",

doi = "10.14778/3236187.3236215",

language = "English",

volume = "11",

pages = "1688--1701",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "VLDB Endowment",

number = "11",

}

TY - JOUR

T1 - ModelarDB

T2 - Modular Model-based Time Series Management with Spark and Cassandra

AU - Jensen, Søren Kejser

AU - Pedersen, Torben Bach

AU - Thomsen, Christian

PY - 2018/7/1

Y1 - 2018/7/1

N2 - Industrial systems, e.g., wind turbines, generate big amounts of data from reliable sensors with high velocity. As it is unfeasible to store and query such big amounts of data, only simple aggregates are currently stored. However, aggregates remove fluctuations and outliers that can reveal underlying problems and limit the knowledge to be gained from historical data. As a remedy, we present the distributed Time Series Management System (TSMS) ModelarDB that uses models to store sensor data. We thus propose an online, adaptive multi-model compression algorithm that maintains data values within a user-defined error bound (possibly zero). We also propose (i) a database schema to store time series as models, (ii) methods to push-down predicates to a key-value store utilizing this schema, (iii) optimized methods to execute aggregate queries on models, (iv) a method to optimize execution of projections through static code-generation, and (v) dynamic extensibility that allows new models to be used without recompiling the TSMS. Further, we present a general modular distributed TSMS architecture and its implementation, ModelarDB, as a portable library, using Apache Spark for query processing and Apache Cassandra for storage. An experimental evaluation shows that, unlike current systems, ModelarDB hits a sweet spot and offers fast ingestion, good compression, and fast, scalable online aggregate query processing at the same time. This is achieved by dynamically adapting to data sets using multiple models. The system degrades gracefully as more outliers occur and the actual errors are much lower than the bounds.

AB - Industrial systems, e.g., wind turbines, generate big amounts of data from reliable sensors with high velocity. As it is unfeasible to store and query such big amounts of data, only simple aggregates are currently stored. However, aggregates remove fluctuations and outliers that can reveal underlying problems and limit the knowledge to be gained from historical data. As a remedy, we present the distributed Time Series Management System (TSMS) ModelarDB that uses models to store sensor data. We thus propose an online, adaptive multi-model compression algorithm that maintains data values within a user-defined error bound (possibly zero). We also propose (i) a database schema to store time series as models, (ii) methods to push-down predicates to a key-value store utilizing this schema, (iii) optimized methods to execute aggregate queries on models, (iv) a method to optimize execution of projections through static code-generation, and (v) dynamic extensibility that allows new models to be used without recompiling the TSMS. Further, we present a general modular distributed TSMS architecture and its implementation, ModelarDB, as a portable library, using Apache Spark for query processing and Apache Cassandra for storage. An experimental evaluation shows that, unlike current systems, ModelarDB hits a sweet spot and offers fast ingestion, good compression, and fast, scalable online aggregate query processing at the same time. This is achieved by dynamically adapting to data sets using multiple models. The system degrades gracefully as more outliers occur and the actual errors are much lower than the bounds.

UR - http://www.scopus.com/inward/record.url?scp=85058895804&partnerID=8YFLogxK

U2 - 10.14778/3236187.3236215

DO - 10.14778/3236187.3236215

M3 - Journal article

SN - 2150-8097

VL - 11

SP - 1688

EP - 1701

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 11

ER -

ModelarDB: Modular Model-based Time Series Management with Spark and Cassandra

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Publikation

ModelarDB: Integrated Model-Based Management of Time Series from Edge to Cloud

Machine Learning Platform for Extreme Scale Computing on Compressed IoT Data

Time Series Management Systems: A 2022 Survey

Citationsformater