ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce

Publikation: Bidrag til bog/antologi/rapport/konference proceedingBidrag til rapportForskning

19 Citationer (Scopus)

Resumé

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL fl ows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This report presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensi ons (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with v ery few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The report describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares
favourably with other MapReduce data warehousing tools.
OriginalsprogEngelsk
TitelEnglish
Antal sider25
Udgivelses stedTech Report TR-29
ForlagDepartment of Computer Science, Aalborg University
Publikationsdato1 aug. 2011
StatusUdgivet - 1 aug. 2011

Fingerprint

Mathematical transformations
Data warehouses
Processing
Resource allocation
Stars
Scalability
Productivity

Citer dette

Xiufeng, L., Thomsen, C., & Pedersen, T. B. (2011). ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. I English Tech Report TR-29: Department of Computer Science, Aalborg University.
Xiufeng, Liu ; Thomsen, Christian ; Pedersen, Torben Bach. / ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. English. Tech Report TR-29 : Department of Computer Science, Aalborg University, 2011.
@inbook{2dea9bf5e3484c34a6b458d8ee28faf4,
title = "ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce",
abstract = "Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL fl ows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This report presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensi ons (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with v ery few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The report describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and comparesfavourably with other MapReduce data warehousing tools.",
author = "Liu Xiufeng and Christian Thomsen and Pedersen, {Torben Bach}",
note = "Technical Report",
year = "2011",
month = "8",
day = "1",
language = "English",
booktitle = "English",
publisher = "Department of Computer Science, Aalborg University",

}

Xiufeng, L, Thomsen, C & Pedersen, TB 2011, ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. i English. Department of Computer Science, Aalborg University, Tech Report TR-29.

ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. / Xiufeng, Liu; Thomsen, Christian; Pedersen, Torben Bach.

English. Tech Report TR-29 : Department of Computer Science, Aalborg University, 2011.

Publikation: Bidrag til bog/antologi/rapport/konference proceedingBidrag til rapportForskning

TY - GEN

T1 - ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce

AU - Xiufeng, Liu

AU - Thomsen, Christian

AU - Pedersen, Torben Bach

N1 - Technical Report

PY - 2011/8/1

Y1 - 2011/8/1

N2 - Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL fl ows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This report presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensi ons (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with v ery few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The report describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and comparesfavourably with other MapReduce data warehousing tools.

AB - Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL fl ows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This report presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensi ons (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with v ery few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The report describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and comparesfavourably with other MapReduce data warehousing tools.

M3 - Report chapter

BT - English

PB - Department of Computer Science, Aalborg University

CY - Tech Report TR-29

ER -

Xiufeng L, Thomsen C, Pedersen TB. ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce. I English. Tech Report TR-29: Department of Computer Science, Aalborg University. 2011