Automatic analysis of dialect literature: advantages and challenges

Kevin Watson, Marie Møller Jensen

Research output: Contribution to book/anthology/report/conference proceedingBook chapterResearchpeer-review


Although the type of orthographic variation found in dialect literature has often been dismissed as somewhat unwieldy and haphazard, recent work has argued that a systematic analysis of such spelling can reveal details about the sociolinguistic and phonological salience of the features being represented (Honeybone & Watson 2013). To do this, respellings are treated as sociolinguistic variables and quantified. As with analysis of sociolinguistic variables in speech, this requires the identification of the relevant variable contexts, and (for categorical variables) the labelling of particular variants. In speech, great advances have been made in the use of automatic methods for identifying relevant contexts (e.g. LaBB-CAT, Fromont & Hay 2012), but in the analysis of spelling variants in dialect literature, manual identification and labelling is still the norm. This increases the time needed to carry out the analysis, and limits the size of the datasets that have been considered in the field so far. In this chapter we assess the utility of two corpus linguistic tools for the automatic analysis of dialect literature – particularly contemporary, humorous, localised dialect literature (CHLDL). The first is VARD, a tool which was designed to standardise spellings in historical English texts (Baron & Rayson 2008). Using techniques from modern spell checkers alongside training algorithms, VARD standardises texts with non-standard spellings, adding a standard spelling layer to a non-standardly spelled text. The result is that each non-standardly spelled word has an associated equivalent in standard spelling. This set of correspondences is the input to the second tool, DICER (Discovery and Investigation of Character Edit Rules; Baron et al 2009). DICER identifies and quantifies the differences between the standard and non-standard spellings, providing information about, for example, which characters are changed and how frequently the changes occur. Taken together, these tools offer an opportunity to upscale the data used for the analysis of dialect literature. To evaluate these tools, we use two CHLDL texts: one from Liverpool English, and one from Newcastle English. For each of these texts, we assess the accuracy of the automatic coding in VARD, and test the amount of training data needed for accurate results. We then assess the quantification in DICER, and compare it to what is already known about the respellings of these texts (e.g. Honeybone & Watson 2013, Beal 2000, Jensen 2013). Finally, we reflect on the advantages and likely future challenges of this kind of analysis in the investigation of dialect literature.
Original languageEnglish
Title of host publicationDialect Writing and the North of England
EditorsPatrick Honeybone, Warren Maguire
PublisherEdinburgh University Press
Publication dateSept 2020
ISBN (Print) 9781474442565
Publication statusPublished - Sept 2020


Dive into the research topics of 'Automatic analysis of dialect literature: advantages and challenges'. Together they form a unique fingerprint.

Cite this