Automatic analysis of dialect literature: advantages and challenges

Kevin Watson; Marie Møller Jensen

Automatic analysis of dialect literature: advantages and challenges

Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Bidrag til bog/antologi › Forskning › peer review

Abstract

Although the type of orthographic variation found in dialect literature has often been dismissed as somewhat unwieldy and haphazard, recent work has argued that a systematic analysis of such spelling can reveal details about the sociolinguistic and phonological salience of the features being represented (Honeybone & Watson 2013). To do this, respellings are treated as sociolinguistic variables and quantified. As with analysis of sociolinguistic variables in speech, this requires the identification of the relevant variable contexts, and (for categorical variables) the labelling of particular variants. In speech, great advances have been made in the use of automatic methods for identifying relevant contexts (e.g. LaBB-CAT, Fromont & Hay 2012), but in the analysis of spelling variants in dialect literature, manual identification and labelling is still the norm. This increases the time needed to carry out the analysis, and limits the size of the datasets that have been considered in the field so far. In this chapter we assess the utility of two corpus linguistic tools for the automatic analysis of dialect literature – particularly contemporary, humorous, localised dialect literature (CHLDL). The first is VARD, a tool which was designed to standardise spellings in historical English texts (Baron & Rayson 2008). Using techniques from modern spell checkers alongside training algorithms, VARD standardises texts with non-standard spellings, adding a standard spelling layer to a non-standardly spelled text. The result is that each non-standardly spelled word has an associated equivalent in standard spelling. This set of correspondences is the input to the second tool, DICER (Discovery and Investigation of Character Edit Rules; Baron et al 2009). DICER identifies and quantifies the differences between the standard and non-standard spellings, providing information about, for example, which characters are changed and how frequently the changes occur. Taken together, these tools offer an opportunity to upscale the data used for the analysis of dialect literature. To evaluate these tools, we use two CHLDL texts: one from Liverpool English, and one from Newcastle English. For each of these texts, we assess the accuracy of the automatic coding in VARD, and test the amount of training data needed for accurate results. We then assess the quantification in DICER, and compare it to what is already known about the respellings of these texts (e.g. Honeybone & Watson 2013, Beal 2000, Jensen 2013). Finally, we reflect on the advantages and likely future challenges of this kind of analysis in the investigation of dialect literature.

Originalsprog	Engelsk
Titel	Dialect Writing and the North of England
Redaktører	Patrick Honeybone, Warren Maguire
Forlag	Edinburgh University Press
Publikationsdato	sep. 2020
Kapitel	14
ISBN (Trykt)	9781474442565
Status	Udgivet - sep. 2020

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

https://edinburghuniversitypress.com/book-dialect-writing-and-the-north-of-england.html

Citationsformater

@inbook{a84c0de5c7c34e4ab7cfe7c9c3b3b39a,

title = "Automatic analysis of dialect literature: advantages and challenges",

abstract = "Although the type of orthographic variation found in dialect literature has often been dismissed as somewhat unwieldy and haphazard, recent work has argued that a systematic analysis of such spelling can reveal details about the sociolinguistic and phonological salience of the features being represented (Honeybone & Watson 2013). To do this, respellings are treated as sociolinguistic variables and quantified. As with analysis of sociolinguistic variables in speech, this requires the identification of the relevant variable contexts, and (for categorical variables) the labelling of particular variants. In speech, great advances have been made in the use of automatic methods for identifying relevant contexts (e.g. LaBB-CAT, Fromont & Hay 2012), but in the analysis of spelling variants in dialect literature, manual identification and labelling is still the norm. This increases the time needed to carry out the analysis, and limits the size of the datasets that have been considered in the field so far. In this chapter we assess the utility of two corpus linguistic tools for the automatic analysis of dialect literature – particularly contemporary, humorous, localised dialect literature (CHLDL). The first is VARD, a tool which was designed to standardise spellings in historical English texts (Baron & Rayson 2008). Using techniques from modern spell checkers alongside training algorithms, VARD standardises texts with non-standard spellings, adding a standard spelling layer to a non-standardly spelled text. The result is that each non-standardly spelled word has an associated equivalent in standard spelling. This set of correspondences is the input to the second tool, DICER (Discovery and Investigation of Character Edit Rules; Baron et al 2009). DICER identifies and quantifies the differences between the standard and non-standard spellings, providing information about, for example, which characters are changed and how frequently the changes occur. Taken together, these tools offer an opportunity to upscale the data used for the analysis of dialect literature. To evaluate these tools, we use two CHLDL texts: one from Liverpool English, and one from Newcastle English. For each of these texts, we assess the accuracy of the automatic coding in VARD, and test the amount of training data needed for accurate results. We then assess the quantification in DICER, and compare it to what is already known about the respellings of these texts (e.g. Honeybone & Watson 2013, Beal 2000, Jensen 2013). Finally, we reflect on the advantages and likely future challenges of this kind of analysis in the investigation of dialect literature. ",

author = "Kevin Watson and Jensen, {Marie M{\o}ller}",

year = "2020",

month = sep,

language = "English",

isbn = " 9781474442565",

editor = "Patrick Honeybone and Warren Maguire",

booktitle = "Dialect Writing and the North of England",

publisher = "Edinburgh University Press",

address = "United Kingdom",

}

TY - CHAP

T1 - Automatic analysis of dialect literature

T2 - advantages and challenges

AU - Watson, Kevin

AU - Jensen, Marie Møller

PY - 2020/9

Y1 - 2020/9

N2 - Although the type of orthographic variation found in dialect literature has often been dismissed as somewhat unwieldy and haphazard, recent work has argued that a systematic analysis of such spelling can reveal details about the sociolinguistic and phonological salience of the features being represented (Honeybone & Watson 2013). To do this, respellings are treated as sociolinguistic variables and quantified. As with analysis of sociolinguistic variables in speech, this requires the identification of the relevant variable contexts, and (for categorical variables) the labelling of particular variants. In speech, great advances have been made in the use of automatic methods for identifying relevant contexts (e.g. LaBB-CAT, Fromont & Hay 2012), but in the analysis of spelling variants in dialect literature, manual identification and labelling is still the norm. This increases the time needed to carry out the analysis, and limits the size of the datasets that have been considered in the field so far. In this chapter we assess the utility of two corpus linguistic tools for the automatic analysis of dialect literature – particularly contemporary, humorous, localised dialect literature (CHLDL). The first is VARD, a tool which was designed to standardise spellings in historical English texts (Baron & Rayson 2008). Using techniques from modern spell checkers alongside training algorithms, VARD standardises texts with non-standard spellings, adding a standard spelling layer to a non-standardly spelled text. The result is that each non-standardly spelled word has an associated equivalent in standard spelling. This set of correspondences is the input to the second tool, DICER (Discovery and Investigation of Character Edit Rules; Baron et al 2009). DICER identifies and quantifies the differences between the standard and non-standard spellings, providing information about, for example, which characters are changed and how frequently the changes occur. Taken together, these tools offer an opportunity to upscale the data used for the analysis of dialect literature. To evaluate these tools, we use two CHLDL texts: one from Liverpool English, and one from Newcastle English. For each of these texts, we assess the accuracy of the automatic coding in VARD, and test the amount of training data needed for accurate results. We then assess the quantification in DICER, and compare it to what is already known about the respellings of these texts (e.g. Honeybone & Watson 2013, Beal 2000, Jensen 2013). Finally, we reflect on the advantages and likely future challenges of this kind of analysis in the investigation of dialect literature.

AB - Although the type of orthographic variation found in dialect literature has often been dismissed as somewhat unwieldy and haphazard, recent work has argued that a systematic analysis of such spelling can reveal details about the sociolinguistic and phonological salience of the features being represented (Honeybone & Watson 2013). To do this, respellings are treated as sociolinguistic variables and quantified. As with analysis of sociolinguistic variables in speech, this requires the identification of the relevant variable contexts, and (for categorical variables) the labelling of particular variants. In speech, great advances have been made in the use of automatic methods for identifying relevant contexts (e.g. LaBB-CAT, Fromont & Hay 2012), but in the analysis of spelling variants in dialect literature, manual identification and labelling is still the norm. This increases the time needed to carry out the analysis, and limits the size of the datasets that have been considered in the field so far. In this chapter we assess the utility of two corpus linguistic tools for the automatic analysis of dialect literature – particularly contemporary, humorous, localised dialect literature (CHLDL). The first is VARD, a tool which was designed to standardise spellings in historical English texts (Baron & Rayson 2008). Using techniques from modern spell checkers alongside training algorithms, VARD standardises texts with non-standard spellings, adding a standard spelling layer to a non-standardly spelled text. The result is that each non-standardly spelled word has an associated equivalent in standard spelling. This set of correspondences is the input to the second tool, DICER (Discovery and Investigation of Character Edit Rules; Baron et al 2009). DICER identifies and quantifies the differences between the standard and non-standard spellings, providing information about, for example, which characters are changed and how frequently the changes occur. Taken together, these tools offer an opportunity to upscale the data used for the analysis of dialect literature. To evaluate these tools, we use two CHLDL texts: one from Liverpool English, and one from Newcastle English. For each of these texts, we assess the accuracy of the automatic coding in VARD, and test the amount of training data needed for accurate results. We then assess the quantification in DICER, and compare it to what is already known about the respellings of these texts (e.g. Honeybone & Watson 2013, Beal 2000, Jensen 2013). Finally, we reflect on the advantages and likely future challenges of this kind of analysis in the investigation of dialect literature.

UR - https://edinburghuniversitypress.com/book-dialect-writing-and-the-north-of-england.html

M3 - Book chapter

SN - 9781474442565

BT - Dialect Writing and the North of England

A2 - Honeybone, Patrick

A2 - Maguire, Warren

PB - Edinburgh University Press

ER -

Automatic analysis of dialect literature: advantages and challenges

Abstract

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater