Abstract
Abstract. Data cubes are used for analyzing large data sets usually
contained in data warehouses. The most popular data cube tools use
graphical user interfaces (GUI) to do the data analysis. Traditionally
this was necessary since data analysts were not expected to be technical
people. However, in the subsequent decades the data landscape changed
dramatically requiring companies to employ large teams of highly technical data scientists in order to manage and use the ever increasing amount
of data. These data scientists generally use tools like Python, interactive
notebooks, pandas, etc. while modern data cube tools are still GUI based.
To bridge this gap, this paper proposes a Python-based data cube tool
called pyCube. pyCube is able to semi-automatically create data cubes
for data stored in an RDBMS and manages the data cube metadata.
pyCube’s programmatic interface enables data scientists to query data
cubes by specifying the metadata of the desired result. pyCube is experimentally evaluated on Star Schema Benchmark (SSB). The results
show that pyCube vastly outperforms different implementations of SSB
queries in pandas in both runtime and memory while being easier to read
and write.
contained in data warehouses. The most popular data cube tools use
graphical user interfaces (GUI) to do the data analysis. Traditionally
this was necessary since data analysts were not expected to be technical
people. However, in the subsequent decades the data landscape changed
dramatically requiring companies to employ large teams of highly technical data scientists in order to manage and use the ever increasing amount
of data. These data scientists generally use tools like Python, interactive
notebooks, pandas, etc. while modern data cube tools are still GUI based.
To bridge this gap, this paper proposes a Python-based data cube tool
called pyCube. pyCube is able to semi-automatically create data cubes
for data stored in an RDBMS and manages the data cube metadata.
pyCube’s programmatic interface enables data scientists to query data
cubes by specifying the metadata of the desired result. pyCube is experimentally evaluated on Star Schema Benchmark (SSB). The results
show that pyCube vastly outperforms different implementations of SSB
queries in pandas in both runtime and memory while being easier to read
and write.
Original language | English |
---|---|
Title of host publication | Big Data Analytics and Knowledge Discovery |
Number of pages | 15 |
Publication date | 2024 |
Publication status | Published - 2024 |