Annotated text databases in the context of the Kaj Munk corpus: One database model, one query language, and several applications

Ulrik Sandborg-Petersen

Annotated text databases in the context of the Kaj Munk corpus: One database model, one query language, and several applications

Research output: PhD thesis

814 Downloads (Pure)

Abstract

The central theme of this PhD dissertation is “annotated text databases”. An annotated
text database is a collection of text plus information about that text, stored in a com-puter system for easy update and access. The “information about the text” constitutes the
annotations of the text.

My PhD work has been carried out under the organizational umbrella of the Kaj Munk
Research Centre at Aalborg University, Denmark. Kaj Munk (1898–1944) was an influential and prolific playwright, journalist, pastor, and poet, whose influence was widely
felt — both inside and outside of Denmark — during the period between World War I
and World War II. He was murdered by Gestapo in early January 1944 for his resistance
stance.

The two main tasks of the Kaj Munk Research Centre in which I have been involved
during my PhD work are: a) Digitizing the nachlass of Kaj Munk, and b) Making the
texts of Kaj Munk available electronically to the general public. My dissertation reflects
these tasks by taking the works of Kaj Munk as the empirical basis, the empirical sample
data, on which to test the theoretical advancements made in my dissertation.

My work has thus not been about Kaj Munk or his works as seen from a historical or
even literary perspective. My perspective on Kaj Munk’s works has been that of a computer scientist seeking to represent annotated versions of Kaj Munk’s works in a computer
database system, and supporting easy querying of these annotated texts. As such, the fact
that the empirical basis has been Kaj Munk’s works is largely immaterial; the principles
crystallized, the methods obtained, and the system implemented could equally well have been brought to bear on any other collection of annotated text. Future research might see
such endeavors.

The theoretical advancements which I have gained during my PhD build on the work
of a number of other people, the primary point of departure being the work of Dr. Crist-Jan
Doedens in his PhD dissertation from 1994: “Text Databases — One Database Model and
Several Retrieval Languages”, University of Utrecht, the Netherlands. Dr. Doedens, in his
PhD dissertation, described the “Monads dot Features” (or MdF) model of annotated text,
as well as the “QL query language” defined over MdF databases.

In my work, I have taken the EMdF text database model, and have both reduced it in
scope in some areas, and have also extended it in other areas, thus arriving at the EMdF
model. I have also taken Doedens’s QL query language, and have reduced part of it to a
slightly smaller sub-language, but have also extended it in numerous ways, thus arriving
at the “MQL query language”.

The EMdF model is able to express almost any annotation necessary for representing
linguistic annotations of text. As I show in Chapter 10, it is certainly the case that all of
the annotations with which we in the Kaj Munk Research Centre have desired to enrich
the Kaj Munk Corpus, can be expressed in the EMdF model. The MQL query language is a “full access language”, supporting the four operations
create”, “retrieve”, “update”, and “delete” on all of the data domains in the EMdF model.
As such, it goes beyond Doedens’s QL, which was only a “retrieval” language.
I have implemented the EMdF model and the MQL query language in the “Emdros”
corpus query system. In so doing, I have fulfilled most of the demands which Doedens
placed on an annotated text database system, as I show in Chapters 4, 5, and 14.

The dissertation is structured as follows.

Part I on “Theoretrical Foundations” encompasses Chapters 1 to 8.

Chapter 1 introduces the topics of the thesis, and provides definitions of some of the
main terms used in the dissertation. Chapter 2 provides a literature review. Chapter 3
provides a brief introduction to the topic of “ontology”, for use in later chapters. Chapter
4 introduces and discusses the EMdF model, while Chapter 5 does the same for the MQL
query language. Chapter 6 introduces and discusses the Sheaf, which is one kind of output
from an MQL query. Chapter 7 describes a general algorithm for, and a classification
of possible strategies for, “harvesting” a Sheaf, that is, turning a Sheaf into meaningful
results. Chapter 8 discusses the relationship between annotated text (as expressible in the
EMdF model) on the one hand, and time on the other.

Part II on “Applications” encompasses Chapters 9 to 13.

Chapter 9 introduces Part II. Chapter 10 describes how the theoretical foundations laid
in Part I can be used to implement the Kaj Munk Corpus in the EMdF model. Chapter 11
discusses the principles behind, and the functionality of, a web-based application which
I have written on top of Emdros in order to support collaborative annotation of the Kaj
Munk Corpus. Chapter 12 is the central application-chapter, in which I show how both the
EMdF model, the MQL query language, and the harvesting procedure described in Part I
can be brought to bear on the task of making Kaj Munk’s works available electronically
to the general public. I do so by describing how I have implemented a “Munk Browser” desktop application. Chapter 13 discusses ways in which the EMdF model and the MQL
query language can be used to support the process of finding quotations from one corpus
in another corpus, in this case, finding quotations from the Bible in the Kaj Munk Corpus.
Part III on “Perspectives” is very short, encompassing only two chapters.

Chapter 14 discusses ways in which the EMdF model and the MQL query language
can be extended to support the requirements of the problem of storing and retrieving
annotated text even better. Finally,

Chapter 15 concludes the dissertation.
Appendix A gives the grammar for the subset of the MQL query language which
closely resembles Doedens’s QL.

Seven already-published, internationally peer-reviewed articles accompany the dissertation in Appendix B, and form part of the basis for evaluation of the dissertation.

Translated title of the contribution	Annoterede tekstdatabaser i kontekst af Kaj Munk Korpusset: Èn databasemodel, ét søgesprog og flere anvendelser
Original language	English
Place of Publication	Aalborg
Publisher	InDiMedia, Department of Communication, Aalborg University
Publication status	Published - May 2008

Access to Document

Sandborg-Petersen-PhDFinal published version, 5.01 MB

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@misc{30d456f0921e11de90ca000ea68e967b,

title = "Annotated text databases in the context of the Kaj Munk corpus: One database model, one query language, and several applications",

abstract = "The central theme of this PhD dissertation is “annotated text databases”. An annotated text database is a collection of text plus information about that text, stored in a com-puter system for easy update and access. The “information about the text” constitutes the annotations of the text.My PhD work has been carried out under the organizational umbrella of the Kaj Munk Research Centre at Aalborg University, Denmark. Kaj Munk (1898–1944) was an influential and prolific playwright, journalist, pastor, and poet, whose influence was widely felt — both inside and outside of Denmark — during the period between World War I and World War II. He was murdered by Gestapo in early January 1944 for his resistance stance.The two main tasks of the Kaj Munk Research Centre in which I have been involved during my PhD work are: a) Digitizing the nachlass of Kaj Munk, and b) Making the texts of Kaj Munk available electronically to the general public. My dissertation reflects these tasks by taking the works of Kaj Munk as the empirical basis, the empirical sample data, on which to test the theoretical advancements made in my dissertation.My work has thus not been about Kaj Munk or his works as seen from a historical or even literary perspective. My perspective on Kaj Munk{\textquoteright}s works has been that of a computer scientist seeking to represent annotated versions of Kaj Munk{\textquoteright}s works in a computer database system, and supporting easy querying of these annotated texts. As such, the fact that the empirical basis has been Kaj Munk{\textquoteright}s works is largely immaterial; the principles crystallized, the methods obtained, and the system implemented could equally well have been brought to bear on any other collection of annotated text. Future research might see such endeavors.The theoretical advancements which I have gained during my PhD build on the work of a number of other people, the primary point of departure being the work of Dr. Crist-Jan Doedens in his PhD dissertation from 1994: “Text Databases — One Database Model and Several Retrieval Languages”, University of Utrecht, the Netherlands. Dr. Doedens, in his PhD dissertation, described the “Monads dot Features” (or MdF) model of annotated text, as well as the “QL query language” defined over MdF databases. In my work, I have taken the EMdF text database model, and have both reduced it in scope in some areas, and have also extended it in other areas, thus arriving at the EMdF model. I have also taken Doedens{\textquoteright}s QL query language, and have reduced part of it to a slightly smaller sub-language, but have also extended it in numerous ways, thus arriving at the “MQL query language”. The EMdF model is able to express almost any annotation necessary for representing linguistic annotations of text. As I show in Chapter 10, it is certainly the case that all of the annotations with which we in the Kaj Munk Research Centre have desired to enrich the Kaj Munk Corpus, can be expressed in the EMdF model. The MQL query language is a “full access language”, supporting the four operations create”, “retrieve”, “update”, and “delete” on all of the data domains in the EMdF model. As such, it goes beyond Doedens{\textquoteright}s QL, which was only a “retrieval” language. I have implemented the EMdF model and the MQL query language in the “Emdros” corpus query system. In so doing, I have fulfilled most of the demands which Doedens placed on an annotated text database system, as I show in Chapters 4, 5, and 14.The dissertation is structured as follows.Part I on “Theoretrical Foundations” encompasses Chapters 1 to 8. Chapter 1 introduces the topics of the thesis, and provides definitions of some of the main terms used in the dissertation. Chapter 2 provides a literature review. Chapter 3 provides a brief introduction to the topic of “ontology”, for use in later chapters. Chapter 4 introduces and discusses the EMdF model, while Chapter 5 does the same for the MQL query language. Chapter 6 introduces and discusses the Sheaf, which is one kind of output from an MQL query. Chapter 7 describes a general algorithm for, and a classification of possible strategies for, “harvesting” a Sheaf, that is, turning a Sheaf into meaningful results. Chapter 8 discusses the relationship between annotated text (as expressible in the EMdF model) on the one hand, and time on the other.Part II on “Applications” encompasses Chapters 9 to 13.Chapter 9 introduces Part II. Chapter 10 describes how the theoretical foundations laid in Part I can be used to implement the Kaj Munk Corpus in the EMdF model. Chapter 11 discusses the principles behind, and the functionality of, a web-based application which I have written on top of Emdros in order to support collaborative annotation of the Kaj Munk Corpus. Chapter 12 is the central application-chapter, in which I show how both the EMdF model, the MQL query language, and the harvesting procedure described in Part I can be brought to bear on the task of making Kaj Munk{\textquoteright}s works available electronically to the general public. I do so by describing how I have implemented a “Munk Browser” desktop application. Chapter 13 discusses ways in which the EMdF model and the MQL query language can be used to support the process of finding quotations from one corpus in another corpus, in this case, finding quotations from the Bible in the Kaj Munk Corpus. Part III on “Perspectives” is very short, encompassing only two chapters.Chapter 14 discusses ways in which the EMdF model and the MQL query language can be extended to support the requirements of the problem of storing and retrieving annotated text even better. Finally, Chapter 15 concludes the dissertation. Appendix A gives the grammar for the subset of the MQL query language which closely resembles Doedens{\textquoteright}s QL.Seven already-published, internationally peer-reviewed articles accompany the dissertation in Appendix B, and form part of the basis for evaluation of the dissertation.",

author = "Ulrik Sandborg-Petersen",

year = "2008",

month = may,

language = "English",

publisher = "InDiMedia, Department of Communication, Aalborg University",

}

TY - GEN

T1 - Annotated text databases in the context of the Kaj Munk corpus

T2 - One database model, one query language, and several applications

AU - Sandborg-Petersen, Ulrik

PY - 2008/5

Y1 - 2008/5

N2 - The central theme of this PhD dissertation is “annotated text databases”. An annotated text database is a collection of text plus information about that text, stored in a com-puter system for easy update and access. The “information about the text” constitutes the annotations of the text.My PhD work has been carried out under the organizational umbrella of the Kaj Munk Research Centre at Aalborg University, Denmark. Kaj Munk (1898–1944) was an influential and prolific playwright, journalist, pastor, and poet, whose influence was widely felt — both inside and outside of Denmark — during the period between World War I and World War II. He was murdered by Gestapo in early January 1944 for his resistance stance.The two main tasks of the Kaj Munk Research Centre in which I have been involved during my PhD work are: a) Digitizing the nachlass of Kaj Munk, and b) Making the texts of Kaj Munk available electronically to the general public. My dissertation reflects these tasks by taking the works of Kaj Munk as the empirical basis, the empirical sample data, on which to test the theoretical advancements made in my dissertation.My work has thus not been about Kaj Munk or his works as seen from a historical or even literary perspective. My perspective on Kaj Munk’s works has been that of a computer scientist seeking to represent annotated versions of Kaj Munk’s works in a computer database system, and supporting easy querying of these annotated texts. As such, the fact that the empirical basis has been Kaj Munk’s works is largely immaterial; the principles crystallized, the methods obtained, and the system implemented could equally well have been brought to bear on any other collection of annotated text. Future research might see such endeavors.The theoretical advancements which I have gained during my PhD build on the work of a number of other people, the primary point of departure being the work of Dr. Crist-Jan Doedens in his PhD dissertation from 1994: “Text Databases — One Database Model and Several Retrieval Languages”, University of Utrecht, the Netherlands. Dr. Doedens, in his PhD dissertation, described the “Monads dot Features” (or MdF) model of annotated text, as well as the “QL query language” defined over MdF databases. In my work, I have taken the EMdF text database model, and have both reduced it in scope in some areas, and have also extended it in other areas, thus arriving at the EMdF model. I have also taken Doedens’s QL query language, and have reduced part of it to a slightly smaller sub-language, but have also extended it in numerous ways, thus arriving at the “MQL query language”. The EMdF model is able to express almost any annotation necessary for representing linguistic annotations of text. As I show in Chapter 10, it is certainly the case that all of the annotations with which we in the Kaj Munk Research Centre have desired to enrich the Kaj Munk Corpus, can be expressed in the EMdF model. The MQL query language is a “full access language”, supporting the four operations create”, “retrieve”, “update”, and “delete” on all of the data domains in the EMdF model. As such, it goes beyond Doedens’s QL, which was only a “retrieval” language. I have implemented the EMdF model and the MQL query language in the “Emdros” corpus query system. In so doing, I have fulfilled most of the demands which Doedens placed on an annotated text database system, as I show in Chapters 4, 5, and 14.The dissertation is structured as follows.Part I on “Theoretrical Foundations” encompasses Chapters 1 to 8. Chapter 1 introduces the topics of the thesis, and provides definitions of some of the main terms used in the dissertation. Chapter 2 provides a literature review. Chapter 3 provides a brief introduction to the topic of “ontology”, for use in later chapters. Chapter 4 introduces and discusses the EMdF model, while Chapter 5 does the same for the MQL query language. Chapter 6 introduces and discusses the Sheaf, which is one kind of output from an MQL query. Chapter 7 describes a general algorithm for, and a classification of possible strategies for, “harvesting” a Sheaf, that is, turning a Sheaf into meaningful results. Chapter 8 discusses the relationship between annotated text (as expressible in the EMdF model) on the one hand, and time on the other.Part II on “Applications” encompasses Chapters 9 to 13.Chapter 9 introduces Part II. Chapter 10 describes how the theoretical foundations laid in Part I can be used to implement the Kaj Munk Corpus in the EMdF model. Chapter 11 discusses the principles behind, and the functionality of, a web-based application which I have written on top of Emdros in order to support collaborative annotation of the Kaj Munk Corpus. Chapter 12 is the central application-chapter, in which I show how both the EMdF model, the MQL query language, and the harvesting procedure described in Part I can be brought to bear on the task of making Kaj Munk’s works available electronically to the general public. I do so by describing how I have implemented a “Munk Browser” desktop application. Chapter 13 discusses ways in which the EMdF model and the MQL query language can be used to support the process of finding quotations from one corpus in another corpus, in this case, finding quotations from the Bible in the Kaj Munk Corpus. Part III on “Perspectives” is very short, encompassing only two chapters.Chapter 14 discusses ways in which the EMdF model and the MQL query language can be extended to support the requirements of the problem of storing and retrieving annotated text even better. Finally, Chapter 15 concludes the dissertation. Appendix A gives the grammar for the subset of the MQL query language which closely resembles Doedens’s QL.Seven already-published, internationally peer-reviewed articles accompany the dissertation in Appendix B, and form part of the basis for evaluation of the dissertation.

AB - The central theme of this PhD dissertation is “annotated text databases”. An annotated text database is a collection of text plus information about that text, stored in a com-puter system for easy update and access. The “information about the text” constitutes the annotations of the text.My PhD work has been carried out under the organizational umbrella of the Kaj Munk Research Centre at Aalborg University, Denmark. Kaj Munk (1898–1944) was an influential and prolific playwright, journalist, pastor, and poet, whose influence was widely felt — both inside and outside of Denmark — during the period between World War I and World War II. He was murdered by Gestapo in early January 1944 for his resistance stance.The two main tasks of the Kaj Munk Research Centre in which I have been involved during my PhD work are: a) Digitizing the nachlass of Kaj Munk, and b) Making the texts of Kaj Munk available electronically to the general public. My dissertation reflects these tasks by taking the works of Kaj Munk as the empirical basis, the empirical sample data, on which to test the theoretical advancements made in my dissertation.My work has thus not been about Kaj Munk or his works as seen from a historical or even literary perspective. My perspective on Kaj Munk’s works has been that of a computer scientist seeking to represent annotated versions of Kaj Munk’s works in a computer database system, and supporting easy querying of these annotated texts. As such, the fact that the empirical basis has been Kaj Munk’s works is largely immaterial; the principles crystallized, the methods obtained, and the system implemented could equally well have been brought to bear on any other collection of annotated text. Future research might see such endeavors.The theoretical advancements which I have gained during my PhD build on the work of a number of other people, the primary point of departure being the work of Dr. Crist-Jan Doedens in his PhD dissertation from 1994: “Text Databases — One Database Model and Several Retrieval Languages”, University of Utrecht, the Netherlands. Dr. Doedens, in his PhD dissertation, described the “Monads dot Features” (or MdF) model of annotated text, as well as the “QL query language” defined over MdF databases. In my work, I have taken the EMdF text database model, and have both reduced it in scope in some areas, and have also extended it in other areas, thus arriving at the EMdF model. I have also taken Doedens’s QL query language, and have reduced part of it to a slightly smaller sub-language, but have also extended it in numerous ways, thus arriving at the “MQL query language”. The EMdF model is able to express almost any annotation necessary for representing linguistic annotations of text. As I show in Chapter 10, it is certainly the case that all of the annotations with which we in the Kaj Munk Research Centre have desired to enrich the Kaj Munk Corpus, can be expressed in the EMdF model. The MQL query language is a “full access language”, supporting the four operations create”, “retrieve”, “update”, and “delete” on all of the data domains in the EMdF model. As such, it goes beyond Doedens’s QL, which was only a “retrieval” language. I have implemented the EMdF model and the MQL query language in the “Emdros” corpus query system. In so doing, I have fulfilled most of the demands which Doedens placed on an annotated text database system, as I show in Chapters 4, 5, and 14.The dissertation is structured as follows.Part I on “Theoretrical Foundations” encompasses Chapters 1 to 8. Chapter 1 introduces the topics of the thesis, and provides definitions of some of the main terms used in the dissertation. Chapter 2 provides a literature review. Chapter 3 provides a brief introduction to the topic of “ontology”, for use in later chapters. Chapter 4 introduces and discusses the EMdF model, while Chapter 5 does the same for the MQL query language. Chapter 6 introduces and discusses the Sheaf, which is one kind of output from an MQL query. Chapter 7 describes a general algorithm for, and a classification of possible strategies for, “harvesting” a Sheaf, that is, turning a Sheaf into meaningful results. Chapter 8 discusses the relationship between annotated text (as expressible in the EMdF model) on the one hand, and time on the other.Part II on “Applications” encompasses Chapters 9 to 13.Chapter 9 introduces Part II. Chapter 10 describes how the theoretical foundations laid in Part I can be used to implement the Kaj Munk Corpus in the EMdF model. Chapter 11 discusses the principles behind, and the functionality of, a web-based application which I have written on top of Emdros in order to support collaborative annotation of the Kaj Munk Corpus. Chapter 12 is the central application-chapter, in which I show how both the EMdF model, the MQL query language, and the harvesting procedure described in Part I can be brought to bear on the task of making Kaj Munk’s works available electronically to the general public. I do so by describing how I have implemented a “Munk Browser” desktop application. Chapter 13 discusses ways in which the EMdF model and the MQL query language can be used to support the process of finding quotations from one corpus in another corpus, in this case, finding quotations from the Bible in the Kaj Munk Corpus. Part III on “Perspectives” is very short, encompassing only two chapters.Chapter 14 discusses ways in which the EMdF model and the MQL query language can be extended to support the requirements of the problem of storing and retrieving annotated text even better. Finally, Chapter 15 concludes the dissertation. Appendix A gives the grammar for the subset of the MQL query language which closely resembles Doedens’s QL.Seven already-published, internationally peer-reviewed articles accompany the dissertation in Appendix B, and form part of the basis for evaluation of the dissertation.

M3 - PhD thesis

PB - InDiMedia, Department of Communication, Aalborg University

CY - Aalborg

ER -