## Abstract

The goal of this thesis is to design, develop, and evaluate new methods for the approximate matching of hierarchical data represented as labeled trees. In approximate matching scenarios two items should be matched if they are similar. Computing the similarity between labeled trees is hard as in addition to the data values also the structure must be considered. A well-known measure for comparing trees is the tree edit distance. It is computationally expensive and leads to a prohibitively high run time.

Our solution for the approximate matching of hierarchical data are pq-grams. The pq-grams of a tree are all its subtrees of a particular shape. Intuitively, two trees are similar if they have many pq-grams in common. The pq-gram distance is an efficient and effective approximation of the tree edit distance. We analyze the properties of the pq-gram distance and compare it with the tree edit distance and alternative approximations. The pq-grams are stored in the pq-gram index which is implemented as a relation and represents a pq-gram by a fixed-length string. The pq-gram index offers efficient approximate lookups and reduces the approximate pq-gram join to an equality join on strings. We formally proof that the pq-gram index can be incrementally updated based on the log of edit operations without reconstructing intermediate tree versions. The incremental update is independent of the data size and scales to a large number of changes in the data. We introduce windowed pq-grams for the approximate matching of unordered trees. Windowed pq-grams are generated from sorted trees in linear time. Sorting trees is not possible for common ordered tree distances such as the edit distance.

We present the address connector, a new system for synchronizing residential addresses that implements a pq-gram based distance between streets, introduces a global greedy matching that guarantees stable pairs, and links addresses that are stored with different granularity. The connector has been successfully tested with public administration databases. Our extensive experiments on both synthetic and real world data confirm the analytic results and suggest that pq-grams are both useful and scalable.

Our solution for the approximate matching of hierarchical data are pq-grams. The pq-grams of a tree are all its subtrees of a particular shape. Intuitively, two trees are similar if they have many pq-grams in common. The pq-gram distance is an efficient and effective approximation of the tree edit distance. We analyze the properties of the pq-gram distance and compare it with the tree edit distance and alternative approximations. The pq-grams are stored in the pq-gram index which is implemented as a relation and represents a pq-gram by a fixed-length string. The pq-gram index offers efficient approximate lookups and reduces the approximate pq-gram join to an equality join on strings. We formally proof that the pq-gram index can be incrementally updated based on the log of edit operations without reconstructing intermediate tree versions. The incremental update is independent of the data size and scales to a large number of changes in the data. We introduce windowed pq-grams for the approximate matching of unordered trees. Windowed pq-grams are generated from sorted trees in linear time. Sorting trees is not possible for common ordered tree distances such as the edit distance.

We present the address connector, a new system for synchronizing residential addresses that implements a pq-gram based distance between streets, introduces a global greedy matching that guarantees stable pairs, and links addresses that are stored with different granularity. The connector has been successfully tested with public administration databases. Our extensive experiments on both synthetic and real world data confirm the analytic results and suggest that pq-grams are both useful and scalable.

Original language | English |
---|

Publisher | Aalborg Universitet |
---|---|

Number of pages | 125 |

Publication status | Published - 2008 |

Series | Ph.D. thesis |
---|---|

Number | 43 |

ISSN | 1601-0590 |