Annotated text databases in the context of the Kaj Munk corpus: One database model, one query language, and several applications

Research output: PhD thesis

814 Downloads (Pure)

Abstract

The central theme of this PhD dissertation is “annotated text databases”. An annotated
text database is a collection of text plus information about that text, stored in a com-puter system for easy update and access. The “information about the text” constitutes the
annotations of the text.

My PhD work has been carried out under the organizational umbrella of the Kaj Munk
Research Centre at Aalborg University, Denmark. Kaj Munk (1898–1944) was an influential and prolific playwright, journalist, pastor, and poet, whose influence was widely
felt — both inside and outside of Denmark — during the period between World War I
and World War II. He was murdered by Gestapo in early January 1944 for his resistance
stance.

The two main tasks of the Kaj Munk Research Centre in which I have been involved
during my PhD work are: a) Digitizing the nachlass of Kaj Munk, and b) Making the
texts of Kaj Munk available electronically to the general public. My dissertation reflects
these tasks by taking the works of Kaj Munk as the empirical basis, the empirical sample
data, on which to test the theoretical advancements made in my dissertation.

My work has thus not been about Kaj Munk or his works as seen from a historical or
even literary perspective. My perspective on Kaj Munk’s works has been that of a computer scientist seeking to represent annotated versions of Kaj Munk’s works in a computer
database system, and supporting easy querying of these annotated texts. As such, the fact
that the empirical basis has been Kaj Munk’s works is largely immaterial; the principles
crystallized, the methods obtained, and the system implemented could equally well have been brought to bear on any other collection of annotated text. Future research might see
such endeavors.

The theoretical advancements which I have gained during my PhD build on the work
of a number of other people, the primary point of departure being the work of Dr. Crist-Jan
Doedens in his PhD dissertation from 1994: “Text Databases — One Database Model and
Several Retrieval Languages”, University of Utrecht, the Netherlands. Dr. Doedens, in his
PhD dissertation, described the “Monads dot Features” (or MdF) model of annotated text,
as well as the “QL query language” defined over MdF databases.


In my work, I have taken the EMdF text database model, and have both reduced it in
scope in some areas, and have also extended it in other areas, thus arriving at the EMdF
model. I have also taken Doedens’s QL query language, and have reduced part of it to a
slightly smaller sub-language, but have also extended it in numerous ways, thus arriving
at the “MQL query language”.


The EMdF model is able to express almost any annotation necessary for representing
linguistic annotations of text. As I show in Chapter 10, it is certainly the case that all of
the annotations with which we in the Kaj Munk Research Centre have desired to enrich
the Kaj Munk Corpus, can be expressed in the EMdF model. The MQL query language is a “full access language”, supporting the four operations
create”, “retrieve”, “update”, and “delete” on all of the data domains in the EMdF model.
As such, it goes beyond Doedens’s QL, which was only a “retrieval” language.
I have implemented the EMdF model and the MQL query language in the “Emdros”
corpus query system. In so doing, I have fulfilled most of the demands which Doedens
placed on an annotated text database system, as I show in Chapters 4, 5, and 14.

The dissertation is structured as follows.

Part I on “Theoretrical Foundations” encompasses Chapters 1 to 8.


Chapter 1 introduces the topics of the thesis, and provides definitions of some of the
main terms used in the dissertation. Chapter 2 provides a literature review. Chapter 3
provides a brief introduction to the topic of “ontology”, for use in later chapters. Chapter
4 introduces and discusses the EMdF model, while Chapter 5 does the same for the MQL
query language. Chapter 6 introduces and discusses the Sheaf, which is one kind of output
from an MQL query. Chapter 7 describes a general algorithm for, and a classification
of possible strategies for, “harvesting” a Sheaf, that is, turning a Sheaf into meaningful
results. Chapter 8 discusses the relationship between annotated text (as expressible in the
EMdF model) on the one hand, and time on the other.

Part II on “Applications” encompasses Chapters 9 to 13.

Chapter 9 introduces Part II. Chapter 10 describes how the theoretical foundations laid
in Part I can be used to implement the Kaj Munk Corpus in the EMdF model. Chapter 11
discusses the principles behind, and the functionality of, a web-based application which
I have written on top of Emdros in order to support collaborative annotation of the Kaj
Munk Corpus. Chapter 12 is the central application-chapter, in which I show how both the
EMdF model, the MQL query language, and the harvesting procedure described in Part I
can be brought to bear on the task of making Kaj Munk’s works available electronically
to the general public. I do so by describing how I have implemented a “Munk Browser” desktop application. Chapter 13 discusses ways in which the EMdF model and the MQL
query language can be used to support the process of finding quotations from one corpus
in another corpus, in this case, finding quotations from the Bible in the Kaj Munk Corpus.
Part III on “Perspectives” is very short, encompassing only two chapters.

Chapter 14 discusses ways in which the EMdF model and the MQL query language
can be extended to support the requirements of the problem of storing and retrieving
annotated text even better. Finally,

Chapter 15 concludes the dissertation.
Appendix A gives the grammar for the subset of the MQL query language which
closely resembles Doedens’s QL.

Seven already-published, internationally peer-reviewed articles accompany the dissertation in Appendix B, and form part of the basis for evaluation of the dissertation.
Translated title of the contributionAnnoterede tekstdatabaser i kontekst af Kaj Munk Korpusset: Èn databasemodel, ét søgesprog og flere anvendelser
Original languageEnglish
Place of PublicationAalborg
Publisher
Publication statusPublished - May 2008

Cite this