Aspects of Data Warehouse Technologies for Complex Web Data

Christian Thomsen

Aspects of Data Warehouse Technologies for Complex Web Data

Publikation: Ph.d.-afhandling

2698 Downloads (Pure)

Abstract

This thesis is about aspects of specification and development of data
warehouse technologies for complex web data. Today, large amounts of data
exist in different web resources and in different formats. But it is often
hard to analyze and query the often big and complex data or data about the
data (i.e., metadata). It is therefore interesting to apply Data Warehouse
(DW) technology to the data. But to apply DW technology to complex web data
is not straightforward and the DW community faces new and exciting
challenges. This thesis considers some of these challenges.

The work leading to this thesis has primarily been done in relation to the
project European Internet Accessibility Observatory (EIAO) where a data
warehouse for accessibility data (roughly data about how usable web resources
are for disabled users) has been specified and developed. But the results of
the thesis can also be applied to other projects using business intelligence
(BI) and/or complex web data. An interesting perspective is that all the
technologies used and developed in the presented work are based on open source
software.

The thesis presents several tools in a survey of the possibilities for using
open source software for BI purposes. Each category of products is evaluated
against criteria relevant to the use of BI in industry. After this,
experiences from designing and implementing a DW for accessibility data are
presented. Further, the conceptual, logical, and physical models for the DW
are presented. This is believed to be the first time a general and scalable DW
is built for the accessibility field which is both complex to model and to
calculate aggregation results for.

The thesis then presents solutions to general interesting problems
found during the work on developing a DW and supporting DW technologies for
the EIAO project. A new and efficient way to store triples from an OWL
ontology known from the Semantic Web field is presented. In contrast to
traditional triple stores where the data is stored in few, but big, tables with
few columns, the presented solution spreads the data over more tables that may
have many columns. This makes it efficient to insert and extract data, in
particular when using bulk loading where big amounts of data are considered.

A new and flexible way to exchange relational data via the XML format (which
is, e.g., used by web services) is also presented. This method saves labor to
program often complex solutions to handle correct exchange of data. With the
presented method, the user only has to specify what data to export and the
structure of the generated XML. The data can then automatically be exported to
XML and imported into another database just like updates to the XML
automatically can be migrated back to the original database.

Regression test is widely accepted and used in traditional software
development. For Extract--Transform--Load (ETL) software, regression test is,
however, traditionally cumbersome and time-consuming. The thesis points out
crucial differences between test of "normal" software and ETL software and
on that background a new semi-automatic framework for regression test of ETL
software is introduced. The framework makes it easy and fast to start doing
regression test. It only takes minutes to set up regression test with the
framework.

Traditionally DWs have been bulk loaded with new data at regular time
intervals, e.g., monthly, weekly, or daily. But a new trend is to add new data
as soon as it becomes available from, e.g., a web log or another online
resource. This is done by means of SQL INSERT statements but these are slow
compared to bulk loading techniques and the performance of the database systems
drops. Therefore the thesis presents a new and innovative method that combines
the best of these worlds. Data can be made available in the DW exactly when
needed and the user gets bulk-load speeds, but INSERT-like data availability.

Originalsprog	Engelsk
Udgiver	Aalborg Universitet
Status	Udgivet - 2008

Adgang til dokumentet

ThesisForlagets udgivne version, 1,59 MB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Citationsformater

@misc{f21004d0378f11dd8f6f000ea68e967b,

title = "Aspects of Data Warehouse Technologies for Complex Web Data",

abstract = "This thesis is about aspects of specification and development of datawarehouse technologies for complex web data. Today, large amounts of dataexist in different web resources and in different formats. But it is oftenhard to analyze and query the often big and complex data or data about thedata (i.e., metadata). It is therefore interesting to apply Data Warehouse(DW) technology to the data. But to apply DW technology to complex web datais not straightforward and the DW community faces new and excitingchallenges. This thesis considers some of these challenges.The work leading to this thesis has primarily been done in relation to theproject European Internet Accessibility Observatory (EIAO) where a datawarehouse for accessibility data (roughly data about how usable web resourcesare for disabled users) has been specified and developed. But the results ofthe thesis can also be applied to other projects using business intelligence(BI) and/or complex web data. An interesting perspective is that all thetechnologies used and developed in the presented work are based on open sourcesoftware.The thesis presents several tools in a survey of the possibilities for usingopen source software for BI purposes. Each category of products is evaluatedagainst criteria relevant to the use of BI in industry. After this,experiences from designing and implementing a DW for accessibility data arepresented. Further, the conceptual, logical, and physical models for the DWare presented. This is believed to be the first time a general and scalable DWis built for the accessibility field which is both complex to model and tocalculate aggregation results for.The thesis then presents solutions to general interesting problemsfound during the work on developing a DW and supporting DW technologies forthe EIAO project. A new and efficient way to store triples from an OWLontology known from the Semantic Web field is presented. In contrast totraditional triple stores where the data is stored in few, but big, tables withfew columns, the presented solution spreads the data over more tables that mayhave many columns. This makes it efficient to insert and extract data, inparticular when using bulk loading where big amounts of data are considered.A new and flexible way to exchange relational data via the XML format (whichis, e.g., used by web services) is also presented. This method saves labor toprogram often complex solutions to handle correct exchange of data. With thepresented method, the user only has to specify what data to export and thestructure of the generated XML. The data can then automatically be exported toXML and imported into another database just like updates to the XMLautomatically can be migrated back to the original database.Regression test is widely accepted and used in traditional softwaredevelopment. For Extract--Transform--Load (ETL) software, regression test is,however, traditionally cumbersome and time-consuming. The thesis points outcrucial differences between test of {"}normal{"} software and ETL software andon that background a new semi-automatic framework for regression test of ETLsoftware is introduced. The framework makes it easy and fast to start doingregression test. It only takes minutes to set up regression test with theframework.Traditionally DWs have been bulk loaded with new data at regular timeintervals, e.g., monthly, weekly, or daily. But a new trend is to add new dataas soon as it becomes available from, e.g., a web log or another onlineresource. This is done by means of SQL INSERT statements but these are slowcompared to bulk loading techniques and the performance of the database systemsdrops. Therefore the thesis presents a new and innovative method that combinesthe best of these worlds. Data can be made available in the DW exactly whenneeded and the user gets bulk-load speeds, but INSERT-like data availability.",

author = "Christian Thomsen",

year = "2008",

language = "English",

series = "Ph.D. thesis",

publisher = "Aalborg Universitet",

number = "42",

}

TY - GEN

T1 - Aspects of Data Warehouse Technologies for Complex Web Data

AU - Thomsen, Christian

PY - 2008

Y1 - 2008

N2 - This thesis is about aspects of specification and development of datawarehouse technologies for complex web data. Today, large amounts of dataexist in different web resources and in different formats. But it is oftenhard to analyze and query the often big and complex data or data about thedata (i.e., metadata). It is therefore interesting to apply Data Warehouse(DW) technology to the data. But to apply DW technology to complex web datais not straightforward and the DW community faces new and excitingchallenges. This thesis considers some of these challenges.The work leading to this thesis has primarily been done in relation to theproject European Internet Accessibility Observatory (EIAO) where a datawarehouse for accessibility data (roughly data about how usable web resourcesare for disabled users) has been specified and developed. But the results ofthe thesis can also be applied to other projects using business intelligence(BI) and/or complex web data. An interesting perspective is that all thetechnologies used and developed in the presented work are based on open sourcesoftware.The thesis presents several tools in a survey of the possibilities for usingopen source software for BI purposes. Each category of products is evaluatedagainst criteria relevant to the use of BI in industry. After this,experiences from designing and implementing a DW for accessibility data arepresented. Further, the conceptual, logical, and physical models for the DWare presented. This is believed to be the first time a general and scalable DWis built for the accessibility field which is both complex to model and tocalculate aggregation results for.The thesis then presents solutions to general interesting problemsfound during the work on developing a DW and supporting DW technologies forthe EIAO project. A new and efficient way to store triples from an OWLontology known from the Semantic Web field is presented. In contrast totraditional triple stores where the data is stored in few, but big, tables withfew columns, the presented solution spreads the data over more tables that mayhave many columns. This makes it efficient to insert and extract data, inparticular when using bulk loading where big amounts of data are considered.A new and flexible way to exchange relational data via the XML format (whichis, e.g., used by web services) is also presented. This method saves labor toprogram often complex solutions to handle correct exchange of data. With thepresented method, the user only has to specify what data to export and thestructure of the generated XML. The data can then automatically be exported toXML and imported into another database just like updates to the XMLautomatically can be migrated back to the original database.Regression test is widely accepted and used in traditional softwaredevelopment. For Extract--Transform--Load (ETL) software, regression test is,however, traditionally cumbersome and time-consuming. The thesis points outcrucial differences between test of "normal" software and ETL software andon that background a new semi-automatic framework for regression test of ETLsoftware is introduced. The framework makes it easy and fast to start doingregression test. It only takes minutes to set up regression test with theframework.Traditionally DWs have been bulk loaded with new data at regular timeintervals, e.g., monthly, weekly, or daily. But a new trend is to add new dataas soon as it becomes available from, e.g., a web log or another onlineresource. This is done by means of SQL INSERT statements but these are slowcompared to bulk loading techniques and the performance of the database systemsdrops. Therefore the thesis presents a new and innovative method that combinesthe best of these worlds. Data can be made available in the DW exactly whenneeded and the user gets bulk-load speeds, but INSERT-like data availability.

AB - This thesis is about aspects of specification and development of datawarehouse technologies for complex web data. Today, large amounts of dataexist in different web resources and in different formats. But it is oftenhard to analyze and query the often big and complex data or data about thedata (i.e., metadata). It is therefore interesting to apply Data Warehouse(DW) technology to the data. But to apply DW technology to complex web datais not straightforward and the DW community faces new and excitingchallenges. This thesis considers some of these challenges.The work leading to this thesis has primarily been done in relation to theproject European Internet Accessibility Observatory (EIAO) where a datawarehouse for accessibility data (roughly data about how usable web resourcesare for disabled users) has been specified and developed. But the results ofthe thesis can also be applied to other projects using business intelligence(BI) and/or complex web data. An interesting perspective is that all thetechnologies used and developed in the presented work are based on open sourcesoftware.The thesis presents several tools in a survey of the possibilities for usingopen source software for BI purposes. Each category of products is evaluatedagainst criteria relevant to the use of BI in industry. After this,experiences from designing and implementing a DW for accessibility data arepresented. Further, the conceptual, logical, and physical models for the DWare presented. This is believed to be the first time a general and scalable DWis built for the accessibility field which is both complex to model and tocalculate aggregation results for.The thesis then presents solutions to general interesting problemsfound during the work on developing a DW and supporting DW technologies forthe EIAO project. A new and efficient way to store triples from an OWLontology known from the Semantic Web field is presented. In contrast totraditional triple stores where the data is stored in few, but big, tables withfew columns, the presented solution spreads the data over more tables that mayhave many columns. This makes it efficient to insert and extract data, inparticular when using bulk loading where big amounts of data are considered.A new and flexible way to exchange relational data via the XML format (whichis, e.g., used by web services) is also presented. This method saves labor toprogram often complex solutions to handle correct exchange of data. With thepresented method, the user only has to specify what data to export and thestructure of the generated XML. The data can then automatically be exported toXML and imported into another database just like updates to the XMLautomatically can be migrated back to the original database.Regression test is widely accepted and used in traditional softwaredevelopment. For Extract--Transform--Load (ETL) software, regression test is,however, traditionally cumbersome and time-consuming. The thesis points outcrucial differences between test of "normal" software and ETL software andon that background a new semi-automatic framework for regression test of ETLsoftware is introduced. The framework makes it easy and fast to start doingregression test. It only takes minutes to set up regression test with theframework.Traditionally DWs have been bulk loaded with new data at regular timeintervals, e.g., monthly, weekly, or daily. But a new trend is to add new dataas soon as it becomes available from, e.g., a web log or another onlineresource. This is done by means of SQL INSERT statements but these are slowcompared to bulk loading techniques and the performance of the database systemsdrops. Therefore the thesis presents a new and innovative method that combinesthe best of these worlds. Data can be made available in the DW exactly whenneeded and the user gets bulk-load speeds, but INSERT-like data availability.

M3 - PhD thesis

T3 - Ph.D. thesis

PB - Aalborg Universitet

ER -

Aspects of Data Warehouse Technologies for Complex Web Data

Abstract

Adgang til dokumentet

AUB Link

Fingeraftryk

Citationsformater