Abstract
Extract-Transform-Load (ETL) programs process data into data
warehouses (DWs). Rapidly growing data volumes demand systems
that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to
worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD
in an experiment and needs 112 statements while CloudETL only needs 4.
warehouses (DWs). Rapidly growing data volumes demand systems
that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to
worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD
in an experiment and needs 112 statements while CloudETL only needs 4.
Original language | English |
---|---|
Title of host publication | Proceedings of the 18th International Database Engineering & Applications Symposium |
Number of pages | 12 |
Publisher | Association for Computing Machinery (ACM) |
Publication date | 2014 |
Pages | 195-206 |
ISBN (Electronic) | 978-1-4503-2627-8 |
DOIs | |
Publication status | Published - 2014 |
Event | International Database Engineering & Applications Symposium - Instituto Superior de Engenharia do Porto, Porto, Portugal Duration: 7 Jul 2014 → 9 Jul 2014 Conference number: 18 |
Conference
Conference | International Database Engineering & Applications Symposium |
---|---|
Number | 18 |
Location | Instituto Superior de Engenharia do Porto |
Country/Territory | Portugal |
City | Porto |
Period | 07/07/2014 → 09/07/2014 |