CloudETL: Scalable Dimensional ETL for Hadoop and Hive

Publikation: Bog/antologi/afhandling/rapportRapportForskning

Resumé

Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a DBMS-like system for DWs and provides good and scalable analytical features. It is,however, still challenging to do proper dimensional ETL processing with Hive; for example, UPDATEs are not supported which makes handling of slowly changing dimensions (SCDs) very difficult. To remedy this, we here present the cloud-enabled ETL framework CloudETL. CloudETL uses the open source MapReduce implementation Hadoop to parallelize the ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about the technical details of MapReduce. CloudETL provides built-in support for different dimensional concepts, including star schemas, snowflake schemas, and SCDs. In the report,we present how CloudETL works. We present different performance optimizations including a purpose specific data placement policy for Hadoop to co-locate data. Further, we present a performance study using realistic data amounts and compare with other cloud-enabled systems. The results show that CloudETL has good scalability and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity.
OriginalsprogEngelsk
ForlagDepartment of Computer Science, Aalborg University
Antal sider31
StatusUdgivet - 2012
Navn1DB Technical Report
Vol/bindTR-31

Fingerprint

Mathematical transformations
Data warehouses
Stars
Scalability
Productivity
Processing

Citer dette

Xiufeng, L., Thomsen, C., & Pedersen, T. B. (2012). CloudETL: Scalable Dimensional ETL for Hadoop and Hive. Department of Computer Science, Aalborg University. 1DB Technical Report, Bind. TR-31
Xiufeng, Liu ; Thomsen, Christian ; Pedersen, Torben Bach. / CloudETL: Scalable Dimensional ETL for Hadoop and Hive. Department of Computer Science, Aalborg University, 2012. 31 s. (1DB Technical Report, Bind TR-31).
@book{e03af7954b7e418f8aa17d1b15f2b5ba,
title = "CloudETL: Scalable Dimensional ETL for Hadoop and Hive",
abstract = "Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a DBMS-like system for DWs and provides good and scalable analytical features. It is,however, still challenging to do proper dimensional ETL processing with Hive; for example, UPDATEs are not supported which makes handling of slowly changing dimensions (SCDs) very difficult. To remedy this, we here present the cloud-enabled ETL framework CloudETL. CloudETL uses the open source MapReduce implementation Hadoop to parallelize the ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about the technical details of MapReduce. CloudETL provides built-in support for different dimensional concepts, including star schemas, snowflake schemas, and SCDs. In the report,we present how CloudETL works. We present different performance optimizations including a purpose specific data placement policy for Hadoop to co-locate data. Further, we present a performance study using realistic data amounts and compare with other cloud-enabled systems. The results show that CloudETL has good scalability and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity.",
author = "Liu Xiufeng and Christian Thomsen and Pedersen, {Torben Bach}",
year = "2012",
language = "English",
series = "1DB Technical Report",
publisher = "Department of Computer Science, Aalborg University",

}

Xiufeng, L, Thomsen, C & Pedersen, TB 2012, CloudETL: Scalable Dimensional ETL for Hadoop and Hive. 1DB Technical Report, bind TR-31, Department of Computer Science, Aalborg University.

CloudETL: Scalable Dimensional ETL for Hadoop and Hive. / Xiufeng, Liu; Thomsen, Christian; Pedersen, Torben Bach.

Department of Computer Science, Aalborg University, 2012. 31 s. (1DB Technical Report, Bind TR-31).

Publikation: Bog/antologi/afhandling/rapportRapportForskning

TY - RPRT

T1 - CloudETL: Scalable Dimensional ETL for Hadoop and Hive

AU - Xiufeng, Liu

AU - Thomsen, Christian

AU - Pedersen, Torben Bach

PY - 2012

Y1 - 2012

N2 - Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a DBMS-like system for DWs and provides good and scalable analytical features. It is,however, still challenging to do proper dimensional ETL processing with Hive; for example, UPDATEs are not supported which makes handling of slowly changing dimensions (SCDs) very difficult. To remedy this, we here present the cloud-enabled ETL framework CloudETL. CloudETL uses the open source MapReduce implementation Hadoop to parallelize the ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about the technical details of MapReduce. CloudETL provides built-in support for different dimensional concepts, including star schemas, snowflake schemas, and SCDs. In the report,we present how CloudETL works. We present different performance optimizations including a purpose specific data placement policy for Hadoop to co-locate data. Further, we present a performance study using realistic data amounts and compare with other cloud-enabled systems. The results show that CloudETL has good scalability and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity.

AB - Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a DBMS-like system for DWs and provides good and scalable analytical features. It is,however, still challenging to do proper dimensional ETL processing with Hive; for example, UPDATEs are not supported which makes handling of slowly changing dimensions (SCDs) very difficult. To remedy this, we here present the cloud-enabled ETL framework CloudETL. CloudETL uses the open source MapReduce implementation Hadoop to parallelize the ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about the technical details of MapReduce. CloudETL provides built-in support for different dimensional concepts, including star schemas, snowflake schemas, and SCDs. In the report,we present how CloudETL works. We present different performance optimizations including a purpose specific data placement policy for Hadoop to co-locate data. Further, we present a performance study using realistic data amounts and compare with other cloud-enabled systems. The results show that CloudETL has good scalability and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity.

M3 - Report

T3 - 1DB Technical Report

BT - CloudETL: Scalable Dimensional ETL for Hadoop and Hive

PB - Department of Computer Science, Aalborg University

ER -

Xiufeng L, Thomsen C, Pedersen TB. CloudETL: Scalable Dimensional ETL for Hadoop and Hive. Department of Computer Science, Aalborg University, 2012. 31 s. (1DB Technical Report, Bind TR-31).