ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce

Research output: Contribution to book/anthology/report/conference proceedingBook chapterResearchpeer-review

17 Citations (Scopus)

Abstract

Abstract. Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is to process huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.
Original languageEnglish
Title of host publicationTransactions on Large-Scale Data- and Knowledge-Centered Systems VIII : Special Issue on Advances in Data Warehousing and Knowledge Discovery
Number of pages31
Volume7790
PublisherSpringer VS
Publication date2013
Pages1-31
ISBN (Print)978-3-642-37573-6
ISBN (Electronic)978-3-642-37574-3
DOIs
Publication statusPublished - 2013
SeriesTransactions on Large-Scale Data- and Knowledge-Centered Systems
ISSN1869-1994
SeriesLecture Notes in Computer Science
Volume7790
ISSN0302-9743

Bibliographical note

The original publications is available at www.springerlink.com

Fingerprint

Dive into the research topics of 'ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce'. Together they form a unique fingerprint.

Cite this