OLAP over Probabilistic Data Cubes II: Parallel Materialization and Extended Aggregates

X. Xie; K. Zou; X. Hao; T. B. Pedersen; Peiquan Jin; W. Yang

doi:10.1109/TKDE.2019.2913420

OLAP over Probabilistic Data Cubes II: Parallel Materialization and Extended Aggregates

X. Xie, K. Zou, X. Hao, T. B. Pedersen, Peiquan Jin, W. Yang

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

6 Citationer (Scopus)

156 Downloads (Pure)

Abstract

On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges, since a probabilistic database can have exponential number of possible worlds under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., \tt SUMSUM, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. We study two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing. Also, our proposal is versatile in terms of: 1) its capability of supporting common aggregation functions, i.e., \tt SUMSUM, \tt COUNTCOUNT, \tt MAXMAX, and \tt AVGAVG; 2) its adaptivity to different materialization strategies, e.g., full versus partial materialization, with support of our devised cost models and parallelization framework; 3) its coverage of common OLAP operations, i.e., probabilistic slicing and dicing queries. Extensive experiments over real and synthetic datasets show that our techniques are effective and scalable.

Originalsprog	Engelsk
Artikelnummer	8700285
Tidsskrift	IEEE Transactions on Knowledge and Data Engineering
Vol/bind	32
Udgave nummer	10
Sider (fra-til)	1966-1981
Antal sider	16
ISSN	1041-4347
DOI	https://doi.org/10.1109/TKDE.2019.2913420
Status	Udgivet - 1 okt. 2020

Emneord

Probabilistic logic
Aggregates
Sensors
Temperature measurement
Query processing
Convolution
Time measurement
Probabilistic Databases
OLAP
Data Warehousing

Adgang til dokumentet

10.1109/TKDE.2019.2913420

Green Open Access manuscriptAccepteret manuskript, 1,24 MB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

Link to publication in Scopus

Citationsformater

@article{bf38e2fbe5c04a83b7e84c4bdd642e96,

title = "OLAP over Probabilistic Data Cubes II: Parallel Materialization and Extended Aggregates",

abstract = "On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges, since a probabilistic database can have exponential number of possible worlds under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., \tt SUMSUM, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. We study two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing. Also, our proposal is versatile in terms of: 1) its capability of supporting common aggregation functions, i.e., \tt SUMSUM, \tt COUNTCOUNT, \tt MAXMAX, and \tt AVGAVG; 2) its adaptivity to different materialization strategies, e.g., full versus partial materialization, with support of our devised cost models and parallelization framework; 3) its coverage of common OLAP operations, i.e., probabilistic slicing and dicing queries. Extensive experiments over real and synthetic datasets show that our techniques are effective and scalable.",

keywords = "Probabilistic logic, Aggregates, Sensors, Temperature measurement, Query processing, Convolution, Time measurement, Probabilistic Databases, OLAP, Data Warehousing",

author = "X. Xie and K. Zou and X. Hao and Pedersen, {T. B.} and Peiquan Jin and W. Yang",

year = "2020",

month = oct,

day = "1",

doi = "10.1109/TKDE.2019.2913420",

language = "English",

volume = "32",

pages = "1966--1981",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE",

number = "10",

}

TY - JOUR

T1 - OLAP over Probabilistic Data Cubes II

T2 - Parallel Materialization and Extended Aggregates

AU - Xie, X.

AU - Zou, K.

AU - Hao, X.

AU - Pedersen, T. B.

AU - Jin, Peiquan

AU - Yang, W.

PY - 2020/10/1

Y1 - 2020/10/1

N2 - On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges, since a probabilistic database can have exponential number of possible worlds under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., \tt SUMSUM, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. We study two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing. Also, our proposal is versatile in terms of: 1) its capability of supporting common aggregation functions, i.e., \tt SUMSUM, \tt COUNTCOUNT, \tt MAXMAX, and \tt AVGAVG; 2) its adaptivity to different materialization strategies, e.g., full versus partial materialization, with support of our devised cost models and parallelization framework; 3) its coverage of common OLAP operations, i.e., probabilistic slicing and dicing queries. Extensive experiments over real and synthetic datasets show that our techniques are effective and scalable.

AB - On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges, since a probabilistic database can have exponential number of possible worlds under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., \tt SUMSUM, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. We study two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing. Also, our proposal is versatile in terms of: 1) its capability of supporting common aggregation functions, i.e., \tt SUMSUM, \tt COUNTCOUNT, \tt MAXMAX, and \tt AVGAVG; 2) its adaptivity to different materialization strategies, e.g., full versus partial materialization, with support of our devised cost models and parallelization framework; 3) its coverage of common OLAP operations, i.e., probabilistic slicing and dicing queries. Extensive experiments over real and synthetic datasets show that our techniques are effective and scalable.

KW - Probabilistic logic

KW - Aggregates

KW - Sensors

KW - Temperature measurement

KW - Query processing

KW - Convolution

KW - Time measurement

KW - Probabilistic Databases

KW - OLAP

KW - Data Warehousing

UR - http://www.scopus.com/inward/record.url?scp=85091254731&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2019.2913420

DO - 10.1109/TKDE.2019.2913420

M3 - Journal article

SN - 1041-4347

VL - 32

SP - 1966

EP - 1981

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 10

M1 - 8700285

ER -

OLAP over Probabilistic Data Cubes II: Parallel Materialization and Extended Aggregates

Abstract

Emneord

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater