Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Yiyi Chen; Johannes Bjerva

doi:10.48550/arXiv.2306.02646

Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

1 Citation (Scopus)

12 Downloads (Pure)

Abstract

Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences (Jackson et al., 2019; Xu et al., 2020; Karjus et al., 2021; Schapper and Koptjevskaja-Tamm, 2022; François, 2022). While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such crosslingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.

Original language	English
Title of host publication	ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023
Editors	Garrett Nicolai, Eleanor Chodroff, Cagri Coltekin, Fred Mailhot
Number of pages	12
Publisher	Association for Computational Linguistics
Publication date	Jul 2023
Pages	98-109
ISBN (Electronic)	9781959429937
DOIs	https://doi.org/10.48550/arXiv.2306.02646 https://doi.org/10.18653/v1/2023.sigmorphon-1.11
Publication status	Published - Jul 2023
Event	20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology - Toronto, Canada Duration: 14 Jul 2023 → 14 Jul 2023

Conference

Conference	20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Country/Territory	Canada
City	Toronto
Period	14/07/2023 → 14/07/2023

Access to Document

10.48550/arXiv.2306.02646Licence: CC BY 4.0
10.18653/v1/2023.sigmorphon-1.11Licence: CC BY-NC-SA 3.0

Open Access articleFinal published version, 728 KBLicence: CC BY-NC-SA 3.0

https://arxiv.org/pdf/2306.02646.pdf

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

Chen, Y., & Bjerva, J. (2023). Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness. In G. Nicolai, E. Chodroff, C. Coltekin, & F. Mailhot (Eds.), ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023 (pp. 98-109). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2306.02646, https://doi.org/10.18653/v1/2023.sigmorphon-1.11

Chen, Yiyi ; Bjerva, Johannes. / Colexifications for Bootstrapping Cross-lingual Datasets : The Case of Phonology, Concreteness, and Affectiveness. ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023. editor / Garrett Nicolai ; Eleanor Chodroff ; Cagri Coltekin ; Fred Mailhot. Association for Computational Linguistics, 2023. pp. 98-109

@inproceedings{2600315b11cf4812a761419a3e7ba934,

title = "Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness",

abstract = "Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences (Jackson et al., 2019; Xu et al., 2020; Karjus et al., 2021; Schapper and Koptjevskaja-Tamm, 2022; Fran{\c c}ois, 2022). While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such crosslingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.",

author = "Yiyi Chen and Johannes Bjerva",

year = "2023",

month = jul,

doi = "10.48550/arXiv.2306.02646",

language = "English",

pages = "98--109",

editor = "Garrett Nicolai and Eleanor Chodroff and Cagri Coltekin and Fred Mailhot",

booktitle = "ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023",

publisher = "Association for Computational Linguistics",

address = "United States",

note = "20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology ; Conference date: 14-07-2023 Through 14-07-2023",

}

Chen, Y & Bjerva, J 2023, Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness. in G Nicolai, E Chodroff, C Coltekin & F Mailhot (eds), ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023. Association for Computational Linguistics, pp. 98-109, 20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Toronto, Canada, 14/07/2023. https://doi.org/10.48550/arXiv.2306.02646, https://doi.org/10.18653/v1/2023.sigmorphon-1.11

Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness. / Chen, Yiyi ; Bjerva, Johannes.
ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023. ed. / Garrett Nicolai; Eleanor Chodroff; Cagri Coltekin; Fred Mailhot. Association for Computational Linguistics, 2023. p. 98-109.

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

TY - GEN

T1 - Colexifications for Bootstrapping Cross-lingual Datasets

T2 - 20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

AU - Chen, Yiyi

AU - Bjerva, Johannes

PY - 2023/7

Y1 - 2023/7

N2 - Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences (Jackson et al., 2019; Xu et al., 2020; Karjus et al., 2021; Schapper and Koptjevskaja-Tamm, 2022; François, 2022). While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such crosslingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.

AB - Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences (Jackson et al., 2019; Xu et al., 2020; Karjus et al., 2021; Schapper and Koptjevskaja-Tamm, 2022; François, 2022). While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such crosslingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.

UR - http://www.scopus.com/inward/record.url?scp=85175400210&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2306.02646

DO - 10.48550/arXiv.2306.02646

M3 - Article in proceeding

SP - 98

EP - 109

BT - ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023

A2 - Nicolai, Garrett

A2 - Chodroff, Eleanor

A2 - Coltekin, Cagri

A2 - Mailhot, Fred

PB - Association for Computational Linguistics

Y2 - 14 July 2023 through 14 July 2023

ER -

Chen Y , Bjerva J. Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness. In Nicolai G, Chodroff E, Coltekin C, Mailhot F, editors, ACL 2023 - 20th SIGMORPHON Workshop on Computational Morphology, Phonology, and Phonetics, CMPP 2023. Association for Computational Linguistics. 2023. p. 98-109 doi: 10.48550/arXiv.2306.02646, 10.18653/v1/2023.sigmorphon-1.11

Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Abstract

Conference

Access to Document

AUB Link

Other files and links

Fingerprint

Multilingual Modelling for Resource-Poor Languages

EliteForsk- Elite Research Travel Grant 2024

Cite this

Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Abstract

Conference

Access to Document

AUB Link

Other files and links

Fingerprint

Projects

Multilingual Modelling for Resource-Poor Languages

Prizes

EliteForsk- Elite Research Travel Grant 2024

Cite this