Standard

Aspects of Data Warehouse Technologies for Complex Web Data. / Thomsen, Christian.

Aalborg Universitet, 2008. 164 s. (Ph.D. Thesis; Nr. 42).

Publikation: ForskningPhD. afhandling

Harvard

Thomsen, C 2008, Aspects of Data Warehouse Technologies for Complex Web Data. Ph.D.-afhandling, Aalborg Universitet. Ph.D. Thesis, nr. 42

APA

Thomsen, C. (2008). Aspects of Data Warehouse Technologies for Complex Web Data. Aalborg Universitet. (Ph.D. Thesis; Nr. 42).

CBE

Thomsen C 2008. Aspects of Data Warehouse Technologies for Complex Web Data. Aalborg Universitet. 164 s. (Ph.D. Thesis; Nr. 42).

MLA

Thomsen, Christian Aspects of Data Warehouse Technologies for Complex Web Data Aalborg Universitet. 2008. (Ph.D. Thesis; ???journalNumber??? 42).

Vancouver

Thomsen C. Aspects of Data Warehouse Technologies for Complex Web Data. Aalborg Universitet, 2008. 164 s. (Ph.D. Thesis; Nr. 42).

Author

Thomsen, Christian / Aspects of Data Warehouse Technologies for Complex Web Data.

Aalborg Universitet, 2008. 164 s. (Ph.D. Thesis; Nr. 42).

Publikation: ForskningPhD. afhandling

Bibtex

@book{f21004d0378f11dd8f6f000ea68e967b,
title = "Aspects of Data Warehouse Technologies for Complex Web Data",
publisher = "Aalborg Universitet",
author = "Christian Thomsen",
year = "2008",
series = "Ph.D. Thesis",

}

RIS

TY - BOOK

T1 - Aspects of Data Warehouse Technologies for Complex Web Data

A1 - Thomsen,Christian

AU - Thomsen,Christian

PB - Aalborg Universitet

PY - 2008

Y1 - 2008

N2 - This thesis is about aspects of specification and development of data<br/>warehouse technologies for complex web data. Today, large amounts of data<br/>exist in different web resources and in different formats. But it is often<br/>hard to analyze and query the often big and complex data or data about the<br/>data (i.e., metadata). It is therefore interesting to apply Data Warehouse<br/>(DW) technology to the data. But to apply DW technology to complex web data<br/>is not straightforward and the DW community faces new and exciting<br/>challenges. This thesis considers some of these challenges.<br/><br/><br/>The work leading to this thesis has primarily been done in relation to the<br/>project European Internet Accessibility Observatory (EIAO) where a data<br/>warehouse for accessibility data (roughly data about how usable web resources<br/>are for disabled users) has been specified and developed. But the results of<br/>the thesis can also be applied to other projects using business intelligence<br/>(BI) and/or complex web data. An interesting perspective is that all the<br/>technologies used and developed in the presented work are based on open source<br/>software.<br/><br/>The thesis presents several tools in a survey of the possibilities for using<br/>open source software for BI purposes. Each category of products is evaluated<br/>against criteria relevant to the use of BI in industry. After this,<br/>experiences from designing and implementing a DW for accessibility data are<br/>presented. Further, the conceptual, logical, and physical models for the DW<br/>are presented. This is believed to be the first time a general and scalable DW<br/>is built for the accessibility field which is both complex to model and to<br/>calculate aggregation results for.<br/><br/>The thesis then presents solutions to general interesting problems<br/>found during the work on developing a DW and supporting DW technologies for<br/>the EIAO project. A new and efficient way to store triples from an OWL<br/>ontology known from the Semantic Web field is presented. In contrast to<br/>traditional triple stores where the data is stored in few, but big, tables with<br/>few columns, the presented solution spreads the data over more tables that may<br/>have many columns. This makes it efficient to insert and extract data, in<br/>particular when using bulk loading where big amounts of data are considered.<br/><br/>A new and flexible way to exchange relational data via the XML format (which<br/>is, e.g., used by web services) is also presented. This method saves labor to<br/>program often complex solutions to handle correct exchange of data. With the<br/>presented method, the user only has to specify what data to export and the<br/>structure of the generated XML. The data can then automatically be exported to<br/>XML and imported into another database just like updates to the XML<br/>automatically can be migrated back to the original database.<br/><br/>Regression test is widely accepted and used in traditional software<br/>development. For Extract--Transform--Load (ETL) software, regression test is,<br/>however, traditionally cumbersome and time-consuming. The thesis points out<br/>crucial differences between test of "normal" software and ETL software and<br/>on that background a new semi-automatic framework for regression test of ETL<br/>software is introduced. The framework makes it easy and fast to start doing<br/>regression test. It only takes minutes to set up regression test with the<br/>framework.<br/><br/>Traditionally DWs have been bulk loaded with new data at regular time<br/>intervals, e.g., monthly, weekly, or daily. But a new trend is to add new data<br/>as soon as it becomes available from, e.g., a web log or another online<br/>resource. This is done by means of SQL INSERT statements but these are slow<br/>compared to bulk loading techniques and the performance of the database systems<br/>drops. Therefore the thesis presents a new and innovative method that combines<br/>the best of these worlds. Data can be made available in the DW exactly when<br/>needed and the user gets bulk-load speeds, but INSERT-like data availability.<br/>

AB - This thesis is about aspects of specification and development of data<br/>warehouse technologies for complex web data. Today, large amounts of data<br/>exist in different web resources and in different formats. But it is often<br/>hard to analyze and query the often big and complex data or data about the<br/>data (i.e., metadata). It is therefore interesting to apply Data Warehouse<br/>(DW) technology to the data. But to apply DW technology to complex web data<br/>is not straightforward and the DW community faces new and exciting<br/>challenges. This thesis considers some of these challenges.<br/><br/><br/>The work leading to this thesis has primarily been done in relation to the<br/>project European Internet Accessibility Observatory (EIAO) where a data<br/>warehouse for accessibility data (roughly data about how usable web resources<br/>are for disabled users) has been specified and developed. But the results of<br/>the thesis can also be applied to other projects using business intelligence<br/>(BI) and/or complex web data. An interesting perspective is that all the<br/>technologies used and developed in the presented work are based on open source<br/>software.<br/><br/>The thesis presents several tools in a survey of the possibilities for using<br/>open source software for BI purposes. Each category of products is evaluated<br/>against criteria relevant to the use of BI in industry. After this,<br/>experiences from designing and implementing a DW for accessibility data are<br/>presented. Further, the conceptual, logical, and physical models for the DW<br/>are presented. This is believed to be the first time a general and scalable DW<br/>is built for the accessibility field which is both complex to model and to<br/>calculate aggregation results for.<br/><br/>The thesis then presents solutions to general interesting problems<br/>found during the work on developing a DW and supporting DW technologies for<br/>the EIAO project. A new and efficient way to store triples from an OWL<br/>ontology known from the Semantic Web field is presented. In contrast to<br/>traditional triple stores where the data is stored in few, but big, tables with<br/>few columns, the presented solution spreads the data over more tables that may<br/>have many columns. This makes it efficient to insert and extract data, in<br/>particular when using bulk loading where big amounts of data are considered.<br/><br/>A new and flexible way to exchange relational data via the XML format (which<br/>is, e.g., used by web services) is also presented. This method saves labor to<br/>program often complex solutions to handle correct exchange of data. With the<br/>presented method, the user only has to specify what data to export and the<br/>structure of the generated XML. The data can then automatically be exported to<br/>XML and imported into another database just like updates to the XML<br/>automatically can be migrated back to the original database.<br/><br/>Regression test is widely accepted and used in traditional software<br/>development. For Extract--Transform--Load (ETL) software, regression test is,<br/>however, traditionally cumbersome and time-consuming. The thesis points out<br/>crucial differences between test of "normal" software and ETL software and<br/>on that background a new semi-automatic framework for regression test of ETL<br/>software is introduced. The framework makes it easy and fast to start doing<br/>regression test. It only takes minutes to set up regression test with the<br/>framework.<br/><br/>Traditionally DWs have been bulk loaded with new data at regular time<br/>intervals, e.g., monthly, weekly, or daily. But a new trend is to add new data<br/>as soon as it becomes available from, e.g., a web log or another online<br/>resource. This is done by means of SQL INSERT statements but these are slow<br/>compared to bulk loading techniques and the performance of the database systems<br/>drops. Therefore the thesis presents a new and innovative method that combines<br/>the best of these worlds. Data can be made available in the DW exactly when<br/>needed and the user gets bulk-load speeds, but INSERT-like data availability.<br/>

BT - Aspects of Data Warehouse Technologies for Complex Web Data

T3 - Ph.D. Thesis

T3 - en_GB

ER -