Aspects of Data Warehouse Technologies for Complex Web Data

Publikation: ForskningPh.d.-afhandling

Abstrakt

This thesis is about aspects of specification and development of data
warehouse technologies for complex web data. Today, large amounts of data
exist in different web resources and in different formats. But it is often
hard to analyze and query the often big and complex data or data about the
data (i.e., metadata). It is therefore interesting to apply Data Warehouse
(DW) technology to the data. But to apply DW technology to complex web data
is not straightforward and the DW community faces new and exciting
challenges. This thesis considers some of these challenges.


The work leading to this thesis has primarily been done in relation to the
project European Internet Accessibility Observatory (EIAO) where a data
warehouse for accessibility data (roughly data about how usable web resources
are for disabled users) has been specified and developed. But the results of
the thesis can also be applied to other projects using business intelligence
(BI) and/or complex web data. An interesting perspective is that all the
technologies used and developed in the presented work are based on open source
software.

The thesis presents several tools in a survey of the possibilities for using
open source software for BI purposes. Each category of products is evaluated
against criteria relevant to the use of BI in industry. After this,
experiences from designing and implementing a DW for accessibility data are
presented. Further, the conceptual, logical, and physical models for the DW
are presented. This is believed to be the first time a general and scalable DW
is built for the accessibility field which is both complex to model and to
calculate aggregation results for.

The thesis then presents solutions to general interesting problems
found during the work on developing a DW and supporting DW technologies for
the EIAO project. A new and efficient way to store triples from an OWL
ontology known from the Semantic Web field is presented. In contrast to
traditional triple stores where the data is stored in few, but big, tables with
few columns, the presented solution spreads the data over more tables that may
have many columns. This makes it efficient to insert and extract data, in
particular when using bulk loading where big amounts of data are considered.

A new and flexible way to exchange relational data via the XML format (which
is, e.g., used by web services) is also presented. This method saves labor to
program often complex solutions to handle correct exchange of data. With the
presented method, the user only has to specify what data to export and the
structure of the generated XML. The data can then automatically be exported to
XML and imported into another database just like updates to the XML
automatically can be migrated back to the original database.

Regression test is widely accepted and used in traditional software
development. For Extract--Transform--Load (ETL) software, regression test is,
however, traditionally cumbersome and time-consuming. The thesis points out
crucial differences between test of "normal" software and ETL software and
on that background a new semi-automatic framework for regression test of ETL
software is introduced. The framework makes it easy and fast to start doing
regression test. It only takes minutes to set up regression test with the
framework.

Traditionally DWs have been bulk loaded with new data at regular time
intervals, e.g., monthly, weekly, or daily. But a new trend is to add new data
as soon as it becomes available from, e.g., a web log or another online
resource. This is done by means of SQL INSERT statements but these are slow
compared to bulk loading techniques and the performance of the database systems
drops. Therefore the thesis presents a new and innovative method that combines
the best of these worlds. Data can be made available in the DW exactly when
needed and the user gets bulk-load speeds, but INSERT-like data availability.
Luk

Detaljer

This thesis is about aspects of specification and development of data
warehouse technologies for complex web data. Today, large amounts of data
exist in different web resources and in different formats. But it is often
hard to analyze and query the often big and complex data or data about the
data (i.e., metadata). It is therefore interesting to apply Data Warehouse
(DW) technology to the data. But to apply DW technology to complex web data
is not straightforward and the DW community faces new and exciting
challenges. This thesis considers some of these challenges.


The work leading to this thesis has primarily been done in relation to the
project European Internet Accessibility Observatory (EIAO) where a data
warehouse for accessibility data (roughly data about how usable web resources
are for disabled users) has been specified and developed. But the results of
the thesis can also be applied to other projects using business intelligence
(BI) and/or complex web data. An interesting perspective is that all the
technologies used and developed in the presented work are based on open source
software.

The thesis presents several tools in a survey of the possibilities for using
open source software for BI purposes. Each category of products is evaluated
against criteria relevant to the use of BI in industry. After this,
experiences from designing and implementing a DW for accessibility data are
presented. Further, the conceptual, logical, and physical models for the DW
are presented. This is believed to be the first time a general and scalable DW
is built for the accessibility field which is both complex to model and to
calculate aggregation results for.

The thesis then presents solutions to general interesting problems
found during the work on developing a DW and supporting DW technologies for
the EIAO project. A new and efficient way to store triples from an OWL
ontology known from the Semantic Web field is presented. In contrast to
traditional triple stores where the data is stored in few, but big, tables with
few columns, the presented solution spreads the data over more tables that may
have many columns. This makes it efficient to insert and extract data, in
particular when using bulk loading where big amounts of data are considered.

A new and flexible way to exchange relational data via the XML format (which
is, e.g., used by web services) is also presented. This method saves labor to
program often complex solutions to handle correct exchange of data. With the
presented method, the user only has to specify what data to export and the
structure of the generated XML. The data can then automatically be exported to
XML and imported into another database just like updates to the XML
automatically can be migrated back to the original database.

Regression test is widely accepted and used in traditional software
development. For Extract--Transform--Load (ETL) software, regression test is,
however, traditionally cumbersome and time-consuming. The thesis points out
crucial differences between test of "normal" software and ETL software and
on that background a new semi-automatic framework for regression test of ETL
software is introduced. The framework makes it easy and fast to start doing
regression test. It only takes minutes to set up regression test with the
framework.

Traditionally DWs have been bulk loaded with new data at regular time
intervals, e.g., monthly, weekly, or daily. But a new trend is to add new data
as soon as it becomes available from, e.g., a web log or another online
resource. This is done by means of SQL INSERT statements but these are slow
compared to bulk loading techniques and the performance of the database systems
drops. Therefore the thesis presents a new and innovative method that combines
the best of these worlds. Data can be made available in the DW exactly when
needed and the user gets bulk-load speeds, but INSERT-like data availability.
OriginalsprogEngelsk
UdgiverAalborg Universitet
Antal sider164
StatusUdgivet - 2008
SeriePh.D. Thesis
Nummer42
ISSN1601-0590

Download-statistik

Ingen data tilgængelig
ID: 14465305