Aspects of Data Warehouse Technologies for Complex Web Data
Publikation: Forskning › PhD. afhandling
Standard
Aspects of Data Warehouse Technologies for Complex Web Data. / Thomsen, Christian.
Aalborg Universitet, 2008. 164 s. (Ph.D. Thesis; Nr. 42).Publikation: Forskning › PhD. afhandling
Harvard
APA
CBE
MLA
Vancouver
Author
Bibtex
}
RIS
TY - BOOK
T1 - Aspects of Data Warehouse Technologies for Complex Web Data
A1 - Thomsen,Christian
AU - Thomsen,Christian
PB - Aalborg Universitet
PY - 2008
Y1 - 2008
N2 - This thesis is about aspects of specification and development of data<br/>warehouse technologies for complex web data. Today, large amounts of data<br/>exist in different web resources and in different formats. But it is often<br/>hard to analyze and query the often big and complex data or data about the<br/>data (i.e., metadata). It is therefore interesting to apply Data Warehouse<br/>(DW) technology to the data. But to apply DW technology to complex web data<br/>is not straightforward and the DW community faces new and exciting<br/>challenges. This thesis considers some of these challenges.<br/><br/><br/>The work leading to this thesis has primarily been done in relation to the<br/>project European Internet Accessibility Observatory (EIAO) where a data<br/>warehouse for accessibility data (roughly data about how usable web resources<br/>are for disabled users) has been specified and developed. But the results of<br/>the thesis can also be applied to other projects using business intelligence<br/>(BI) and/or complex web data. An interesting perspective is that all the<br/>technologies used and developed in the presented work are based on open source<br/>software.<br/><br/>The thesis presents several tools in a survey of the possibilities for using<br/>open source software for BI purposes. Each category of products is evaluated<br/>against criteria relevant to the use of BI in industry. After this,<br/>experiences from designing and implementing a DW for accessibility data are<br/>presented. Further, the conceptual, logical, and physical models for the DW<br/>are presented. This is believed to be the first time a general and scalable DW<br/>is built for the accessibility field which is both complex to model and to<br/>calculate aggregation results for.<br/><br/>The thesis then presents solutions to general interesting problems<br/>found during the work on developing a DW and supporting DW technologies for<br/>the EIAO project. A new and efficient way to store triples from an OWL<br/>ontology known from the Semantic Web field is presented. In contrast to<br/>traditional triple stores where the data is stored in few, but big, tables with<br/>few columns, the presented solution spreads the data over more tables that may<br/>have many columns. This makes it efficient to insert and extract data, in<br/>particular when using bulk loading where big amounts of data are considered.<br/><br/>A new and flexible way to exchange relational data via the XML format (which<br/>is, e.g., used by web services) is also presented. This method saves labor to<br/>program often complex solutions to handle correct exchange of data. With the<br/>presented method, the user only has to specify what data to export and the<br/>structure of the generated XML. The data can then automatically be exported to<br/>XML and imported into another database just like updates to the XML<br/>automatically can be migrated back to the original database.<br/><br/>Regression test is widely accepted and used in traditional software<br/>development. For Extract--Transform--Load (ETL) software, regression test is,<br/>however, traditionally cumbersome and time-consuming. The thesis points out<br/>crucial differences between test of "normal" software and ETL software and<br/>on that background a new semi-automatic framework for regression test of ETL<br/>software is introduced. The framework makes it easy and fast to start doing<br/>regression test. It only takes minutes to set up regression test with the<br/>framework.<br/><br/>Traditionally DWs have been bulk loaded with new data at regular time<br/>intervals, e.g., monthly, weekly, or daily. But a new trend is to add new data<br/>as soon as it becomes available from, e.g., a web log or another online<br/>resource. This is done by means of SQL INSERT statements but these are slow<br/>compared to bulk loading techniques and the performance of the database systems<br/>drops. Therefore the thesis presents a new and innovative method that combines<br/>the best of these worlds. Data can be made available in the DW exactly when<br/>needed and the user gets bulk-load speeds, but INSERT-like data availability.<br/>
AB - This thesis is about aspects of specification and development of data<br/>warehouse technologies for complex web data. Today, large amounts of data<br/>exist in different web resources and in different formats. But it is often<br/>hard to analyze and query the often big and complex data or data about the<br/>data (i.e., metadata). It is therefore interesting to apply Data Warehouse<br/>(DW) technology to the data. But to apply DW technology to complex web data<br/>is not straightforward and the DW community faces new and exciting<br/>challenges. This thesis considers some of these challenges.<br/><br/><br/>The work leading to this thesis has primarily been done in relation to the<br/>project European Internet Accessibility Observatory (EIAO) where a data<br/>warehouse for accessibility data (roughly data about how usable web resources<br/>are for disabled users) has been specified and developed. But the results of<br/>the thesis can also be applied to other projects using business intelligence<br/>(BI) and/or complex web data. An interesting perspective is that all the<br/>technologies used and developed in the presented work are based on open source<br/>software.<br/><br/>The thesis presents several tools in a survey of the possibilities for using<br/>open source software for BI purposes. Each category of products is evaluated<br/>against criteria relevant to the use of BI in industry. After this,<br/>experiences from designing and implementing a DW for accessibility data are<br/>presented. Further, the conceptual, logical, and physical models for the DW<br/>are presented. This is believed to be the first time a general and scalable DW<br/>is built for the accessibility field which is both complex to model and to<br/>calculate aggregation results for.<br/><br/>The thesis then presents solutions to general interesting problems<br/>found during the work on developing a DW and supporting DW technologies for<br/>the EIAO project. A new and efficient way to store triples from an OWL<br/>ontology known from the Semantic Web field is presented. In contrast to<br/>traditional triple stores where the data is stored in few, but big, tables with<br/>few columns, the presented solution spreads the data over more tables that may<br/>have many columns. This makes it efficient to insert and extract data, in<br/>particular when using bulk loading where big amounts of data are considered.<br/><br/>A new and flexible way to exchange relational data via the XML format (which<br/>is, e.g., used by web services) is also presented. This method saves labor to<br/>program often complex solutions to handle correct exchange of data. With the<br/>presented method, the user only has to specify what data to export and the<br/>structure of the generated XML. The data can then automatically be exported to<br/>XML and imported into another database just like updates to the XML<br/>automatically can be migrated back to the original database.<br/><br/>Regression test is widely accepted and used in traditional software<br/>development. For Extract--Transform--Load (ETL) software, regression test is,<br/>however, traditionally cumbersome and time-consuming. The thesis points out<br/>crucial differences between test of "normal" software and ETL software and<br/>on that background a new semi-automatic framework for regression test of ETL<br/>software is introduced. The framework makes it easy and fast to start doing<br/>regression test. It only takes minutes to set up regression test with the<br/>framework.<br/><br/>Traditionally DWs have been bulk loaded with new data at regular time<br/>intervals, e.g., monthly, weekly, or daily. But a new trend is to add new data<br/>as soon as it becomes available from, e.g., a web log or another online<br/>resource. This is done by means of SQL INSERT statements but these are slow<br/>compared to bulk loading techniques and the performance of the database systems<br/>drops. Therefore the thesis presents a new and innovative method that combines<br/>the best of these worlds. Data can be made available in the DW exactly when<br/>needed and the user gets bulk-load speeds, but INSERT-like data availability.<br/>
BT - Aspects of Data Warehouse Technologies for Complex Web Data
T3 - Ph.D. Thesis
T3 - en_GB
ER -