CloudETL: Scalable Dimensional ETL for Hive

Xiufeng Liu; Christian Thomsen; Torben Bach Pedersen

doi:10.1145/2628194.2628249

CloudETL: Scalable Dimensional ETL for Hive

Xiufeng Liu, Christian Thomsen, Torben Bach Pedersen

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

23 Citations (Scopus)

Abstract

Extract-Transform-Load (ETL) programs process data into data
warehouses (DWs). Rapidly growing data volumes demand systems
that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to
worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD
in an experiment and needs 112 statements while CloudETL only needs 4.

Original language	English
Title of host publication	Proceedings of the 18th International Database Engineering & Applications Symposium
Number of pages	12
Publisher	Association for Computing Machinery
Publication date	2014
Pages	195-206
ISBN (Electronic)	978-1-4503-2627-8
DOIs	https://doi.org/10.1145/2628194.2628249
Publication status	Published - 2014
Event	International Database Engineering & Applications Symposium - Instituto Superior de Engenharia do Porto, Porto, Portugal Duration: 7 Jul 2014 → 9 Jul 2014 Conference number: 18

Conference

Conference	International Database Engineering & Applications Symposium
Number	18
Location	Instituto Superior de Engenharia do Porto
Country/Territory	Portugal
City	Porto
Period	07/07/2014 → 09/07/2014

Access to Document

10.1145/2628194.2628249

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@inproceedings{7bfcaeff13144300b0646dd1a280375e,

title = "CloudETL: Scalable Dimensional ETL for Hive",

abstract = "Extract-Transform-Load (ETL) programs process data into datawarehouses (DWs). Rapidly growing data volumes demand systemsthat scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have toworry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCDin an experiment and needs 112 statements while CloudETL only needs 4.",

author = "Xiufeng Liu and Christian Thomsen and Pedersen, {Torben Bach}",

year = "2014",

doi = "10.1145/2628194.2628249",

language = "English",

pages = "195--206",

booktitle = "Proceedings of the 18th International Database Engineering & Applications Symposium",

publisher = "Association for Computing Machinery",

address = "United States",

note = "International Database Engineering & Applications Symposium, IDEAS ; Conference date: 07-07-2014 Through 09-07-2014",

}

CloudETL: Scalable Dimensional ETL for Hive. / Liu, Xiufeng; Thomsen, Christian ; Pedersen, Torben Bach.
Proceedings of the 18th International Database Engineering & Applications Symposium . Association for Computing Machinery, 2014. p. 195-206.

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

TY - GEN

T1 - CloudETL

T2 - International Database Engineering & Applications Symposium

AU - Liu, Xiufeng

AU - Thomsen, Christian

AU - Pedersen, Torben Bach

N1 - Conference code: 18

PY - 2014

Y1 - 2014

N2 - Extract-Transform-Load (ETL) programs process data into datawarehouses (DWs). Rapidly growing data volumes demand systemsthat scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have toworry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCDin an experiment and needs 112 statements while CloudETL only needs 4.

AB - Extract-Transform-Load (ETL) programs process data into datawarehouses (DWs). Rapidly growing data volumes demand systemsthat scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have toworry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCDin an experiment and needs 112 statements while CloudETL only needs 4.

U2 - 10.1145/2628194.2628249

DO - 10.1145/2628194.2628249

M3 - Article in proceeding

SP - 195

EP - 206

BT - Proceedings of the 18th International Database Engineering & Applications Symposium

PB - Association for Computing Machinery

Y2 - 7 July 2014 through 9 July 2014

ER -

CloudETL: Scalable Dimensional ETL for Hive

Abstract

Conference

Access to Document

AUB Link

Fingerprint

Cite this