ETLMR

A Highly Scalable Dimensional ETL Framework based on MapReduce

Publikation: Bidrag til bog/antologi/rapport/konference proceedingBidrag til bog/antologiForskningpeer review

7 Citationer (Scopus)

Resumé

Abstract. Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is to process huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.
OriginalsprogEngelsk
TitelTransactions on Large-Scale Data- and Knowledge-Centered Systems VIII : Special Issue on Advances in Data Warehousing and Knowledge Discovery
Antal sider31
Vol/bind7790
ForlagSpringer VS
Publikationsdato2013
Sider1-31
ISBN (Trykt)978-3-642-37573-6
ISBN (Elektronisk)978-3-642-37574-3
DOI
StatusUdgivet - 2013
NavnTransactions on Large-Scale Data- and Knowledge-Centered Systems
ISSN1869-1994
NavnLecture Notes in Computer Science
Vol/bind7790
ISSN0302-9743

Fingerprint

Mathematical transformations
Data warehouses
Processing
Resource allocation
Stars
Scalability
Productivity

Citer dette

Xiufeng, L., Thomsen, C., & Pedersen, T. B. (2013). ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. I Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII: Special Issue on Advances in Data Warehousing and Knowledge Discovery (Bind 7790, s. 1-31). Springer VS. Transactions on Large-Scale Data- and Knowledge-Centered Systems, Lecture Notes in Computer Science, Bind. 7790 https://doi.org/10.1007/978-3-642-37574-3_1
Xiufeng, Liu ; Thomsen, Christian ; Pedersen, Torben Bach. / ETLMR : A Highly Scalable Dimensional ETL Framework based on MapReduce. Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII: Special Issue on Advances in Data Warehousing and Knowledge Discovery. Bind 7790 Springer VS, 2013. s. 1-31 (Transactions on Large-Scale Data- and Knowledge-Centered Systems). (Lecture Notes in Computer Science, Bind 7790).
@inbook{cbdecc8884be4c00a198b6f84e380f92,
title = "ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce",
abstract = "Abstract. Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is to process huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.",
author = "Liu Xiufeng and Christian Thomsen and Pedersen, {Torben Bach}",
note = "The original publications is available at www.springerlink.com",
year = "2013",
doi = "10.1007/978-3-642-37574-3_1",
language = "English",
isbn = "978-3-642-37573-6",
volume = "7790",
pages = "1--31",
booktitle = "Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII",
publisher = "Springer VS",

}

Xiufeng, L, Thomsen, C & Pedersen, TB 2013, ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. i Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII: Special Issue on Advances in Data Warehousing and Knowledge Discovery. bind 7790, Springer VS, Transactions on Large-Scale Data- and Knowledge-Centered Systems, Lecture Notes in Computer Science, bind 7790, s. 1-31. https://doi.org/10.1007/978-3-642-37574-3_1

ETLMR : A Highly Scalable Dimensional ETL Framework based on MapReduce. / Xiufeng, Liu; Thomsen, Christian; Pedersen, Torben Bach.

Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII: Special Issue on Advances in Data Warehousing and Knowledge Discovery. Bind 7790 Springer VS, 2013. s. 1-31 (Transactions on Large-Scale Data- and Knowledge-Centered Systems). (Lecture Notes in Computer Science, Bind 7790).

Publikation: Bidrag til bog/antologi/rapport/konference proceedingBidrag til bog/antologiForskningpeer review

TY - CHAP

T1 - ETLMR

T2 - A Highly Scalable Dimensional ETL Framework based on MapReduce

AU - Xiufeng, Liu

AU - Thomsen, Christian

AU - Pedersen, Torben Bach

N1 - The original publications is available at www.springerlink.com

PY - 2013

Y1 - 2013

N2 - Abstract. Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is to process huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.

AB - Abstract. Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is to process huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.

U2 - 10.1007/978-3-642-37574-3_1

DO - 10.1007/978-3-642-37574-3_1

M3 - Book chapter

SN - 978-3-642-37573-6

VL - 7790

SP - 1

EP - 31

BT - Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII

PB - Springer VS

ER -

Xiufeng L, Thomsen C, Pedersen TB. ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. I Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII: Special Issue on Advances in Data Warehousing and Knowledge Discovery. Bind 7790. Springer VS. 2013. s. 1-31. (Transactions on Large-Scale Data- and Knowledge-Centered Systems). (Lecture Notes in Computer Science, Bind 7790). https://doi.org/10.1007/978-3-642-37574-3_1