Data warehousing technologies for large-scale and right-time data

Publikation: ForskningPh.d.-afhandling

Abstrakt

This thesis is about data warehousing technologies for large-scale and right-time data. Today, due to the exponential growth of data, it has become a common practice for many enterprises to process hundreds of gigabytes of data per day. Traditionally, data warehousing populates data from heterogeneous sources into a central data warehouse (DW) by Extract-Transform-Load (ETL) at regular time intervals, e.g., monthly, weekly, or daily. But now, it becomes challenging for large-scale data, and hard to meet the near real-time/right-time business decisions. This thesis considers some of these challenges and makes the following contributions:

First, this thesis presents a new and efficient way to store triples from an OWL Lite ontology known from the Semantic Web field. In contrast to classic triple-stores where the data with the triple format of (subject; predicate; object) is stored in few, but big, tables with few columns, the presented triple-store spreads the data over more tables that may have many columns. The triple-store is optimized by an extensive use of bulk techniques, which makes it very efficient to insert and extract data. The DBMS-based solution makes it very flexible to integrate with other non-triple data.

Second, this thesis presents a middle-ware system for live DW data. Processing live DW data is one of the tricky problems in data warehousing. An innovative method is proposed for processing live DW data, which accumulates data in an intermediate store, then does data modifications on-the-fly when the data is materialized or queried. The data is made available in the DW exactly when needed and users can get bulk-load speeds, but INSERT-like data availability.

Third, this thesis presents the first dimensional ETL programming framework using MapReduce. Parallel ETL is needed for large-scale data, but it is not easy to implement. This presented framework makes this very easy by using the high-level ETL-specific constructs, including those for star schema, snowflake schema, slowly changing dimensions (SCDs) and very large dimensions. The framework can achieve high programming efficiency, i.e., only a few statements needed for implementing a parallel ETL program, and good scalability for processing different DW schemas.

Finally, this thesis presents scalable dimensional ETL for cloud warehouse. Today, organizations gain growing interest in moving data warehousing systems towards the cloud, however, the current data warehousing systems are not yet particularly suited for the cloud. The presented framework exploits Hadoop to parallelize ETL execution and Hive as the warehouse system. It has a shared-nothing architecture, and supports scalable ETL operations on clustered commodity machines. To implement dimensional ETL, this framework can achieve higher programmer productivity than Hive, and its performance is also better.

In summary, this thesis discusses several aspects of the current challenges and problems of data warehousing, including integrating Web data, near real-time/right time data warehousing, handling the exponential growth of data and cloud data warehousing. This thesis proposes a variety of technologies to deal with these specific
issues.
Luk

Detaljer

This thesis is about data warehousing technologies for large-scale and right-time data. Today, due to the exponential growth of data, it has become a common practice for many enterprises to process hundreds of gigabytes of data per day. Traditionally, data warehousing populates data from heterogeneous sources into a central data warehouse (DW) by Extract-Transform-Load (ETL) at regular time intervals, e.g., monthly, weekly, or daily. But now, it becomes challenging for large-scale data, and hard to meet the near real-time/right-time business decisions. This thesis considers some of these challenges and makes the following contributions:

First, this thesis presents a new and efficient way to store triples from an OWL Lite ontology known from the Semantic Web field. In contrast to classic triple-stores where the data with the triple format of (subject; predicate; object) is stored in few, but big, tables with few columns, the presented triple-store spreads the data over more tables that may have many columns. The triple-store is optimized by an extensive use of bulk techniques, which makes it very efficient to insert and extract data. The DBMS-based solution makes it very flexible to integrate with other non-triple data.

Second, this thesis presents a middle-ware system for live DW data. Processing live DW data is one of the tricky problems in data warehousing. An innovative method is proposed for processing live DW data, which accumulates data in an intermediate store, then does data modifications on-the-fly when the data is materialized or queried. The data is made available in the DW exactly when needed and users can get bulk-load speeds, but INSERT-like data availability.

Third, this thesis presents the first dimensional ETL programming framework using MapReduce. Parallel ETL is needed for large-scale data, but it is not easy to implement. This presented framework makes this very easy by using the high-level ETL-specific constructs, including those for star schema, snowflake schema, slowly changing dimensions (SCDs) and very large dimensions. The framework can achieve high programming efficiency, i.e., only a few statements needed for implementing a parallel ETL program, and good scalability for processing different DW schemas.

Finally, this thesis presents scalable dimensional ETL for cloud warehouse. Today, organizations gain growing interest in moving data warehousing systems towards the cloud, however, the current data warehousing systems are not yet particularly suited for the cloud. The presented framework exploits Hadoop to parallelize ETL execution and Hive as the warehouse system. It has a shared-nothing architecture, and supports scalable ETL operations on clustered commodity machines. To implement dimensional ETL, this framework can achieve higher programmer productivity than Hive, and its performance is also better.

In summary, this thesis discusses several aspects of the current challenges and problems of data warehousing, including integrating Web data, near real-time/right time data warehousing, handling the exponential growth of data and cloud data warehousing. This thesis proposes a variety of technologies to deal with these specific
issues.
OriginalsprogEngelsk
UdgiverDepartment of Computer Science, Aalborg University
Antal sider196
StatusUdgivet - aug. 2012
SeriePh.D. Thesis
ISSN1601-0590

Download-statistik

Ingen data tilgængelig
ID: 66622195