ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce

Research output: Contribution to book/anthology/report/conference proceedingReport chapterResearch

19 Citations (Scopus)

Abstract

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL fl ows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This report presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensi ons (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with v ery few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The report describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares
favourably with other MapReduce data warehousing tools.
Original languageEnglish
Title of host publicationEnglish
Number of pages25
Place of PublicationTech Report TR-29
PublisherDepartment of Computer Science, Aalborg University
Publication date1 Aug 2011
Publication statusPublished - 1 Aug 2011

Bibliographical note

Technical Report

Fingerprint Dive into the research topics of 'ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce'. Together they form a unique fingerprint.

Cite this