ETLMR:  A Highly Scalable Dimensional ETL Framework Based on MapReduce

Xiufeng Liu; Christian Thomsen; Torben Bach Pedersen

doi:10.1007/978-3-642-23544-3_8

ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce

Xiufeng Liu, Christian Thomsen, Torben Bach Pedersen

Research output: Contribution to journal › Conference article in Journal › Research › peer-review

23 Citations (Scopus)

1025 Downloads (Pure)

Abstract

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data
warehousing tools.

Original language	English
Book series	Lecture Notes in Computer Science
Volume	6862
Pages (from-to)	96-111
ISSN	0302-9743
DOIs	https://doi.org/10.1007/978-3-642-23544-3_8
Publication status	Published - Sept 2011
Event	13th International Conference on Data Warehousing and Knowledge Discovery - Toulouse, France Duration: 29 Aug 2011 → 2 Sept 2011 Conference number: 13

Conference

Conference	13th International Conference on Data Warehousing and Knowledge Discovery
Number	13
Country/Territory	France
City	Toulouse
Period	29/08/2011 → 02/09/2011

Access to Document

10.1007/978-3-642-23544-3_8

EtlmrSubmitted manuscript, 459 KB

http://www.springerlink.com/content/gq6w36413350588t/

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@inproceedings{e7a2c404ba7c48baa8848b49aec4db00,

title = "ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce",

abstract = "Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce datawarehousing tools.",

author = "Xiufeng Liu and Christian Thomsen and Pedersen, {Torben Bach}",

year = "2011",

month = sep,

doi = "10.1007/978-3-642-23544-3_8",

language = "English",

volume = "6862",

pages = "96--111",

journal = "Lecture Notes in Computer Science",

issn = "0302-9743",

publisher = "Physica-Verlag",

note = "13th International Conference on Data Warehousing and Knowledge Discovery ; Conference date: 29-08-2011 Through 02-09-2011",

}

TY - GEN

T1 - ETLMR

T2 - 13th International Conference on Data Warehousing and Knowledge Discovery

AU - Liu, Xiufeng

AU - Thomsen, Christian

AU - Pedersen, Torben Bach

N1 - Conference code: 13

PY - 2011/9

Y1 - 2011/9

N2 - Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce datawarehousing tools.

AB - Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce datawarehousing tools.

U2 - 10.1007/978-3-642-23544-3_8

DO - 10.1007/978-3-642-23544-3_8

M3 - Conference article in Journal

SN - 0302-9743

VL - 6862

SP - 96

EP - 111

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

Y2 - 29 August 2011 through 2 September 2011

ER -