Embracing Variability in Natural Language Processing
References –Lindow, W., M ¨ohn, D., Niebaum, H., Stellmacher, D., Taubken, H., & Wirrer, J. (1998). Niederdeutsche Grammatik. Schuster.
Bloemhoff, H., & Van der Kooi, J. (2008). Handboek Nedersaksische taal- en letterkunde. Uitgeverij Van Gorcum.
Wieling, M., & Nerbonne, J. (2015). Advances in dialectometry. Annual Review of Linguistics, 1(1), 243–264.
Bartelds, M., & Wieling, M. (2022). Quantifying language variation acoustically with few resources. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3735–3741. https://doi.org/10.18653/v1/2022.naacl-main.273
Buurke, R. S. S. J., Sekeres, H. G., Heeringa, W., Knooihuizen, R., & Wieling, M. (2022). Estimating the level and direction of aggregated sound change of dialects in the northern Netherlands. Taal & Tongval, 74(2), 183–214.
Abstract –Low Saxon (also called ‘Low German’) is a non-standardised West-Germanic language spoken in parts of the Netherlands and Germany. The absence of an interregional standard variety makes it difficult to create large, relatively uniform corpora. This poses challenges to NLP (natural language processing), since such corpora are crucial for modern machine learning approaches.
We will present Low Saxon dialectal variation synchronically and diachronically at different linguistic levels: phonetic, morphological, syntactic and lexical. While the phonetic level is represented by sound recordings, the other levels are based on literary text corpora. The sound recordings cover the last 50 years, whereas the text corpora cover the last 200 years. In addition to this, we will also discuss the preprossessing (e.g., word-boundary detection, PoS-tagging, and lemmatisation) used in the creation of these corpora.
For the phonetic level, the differences between pairs of sound recordings are usually quantified, either between the sound recordings directly or between transcriptions of the recordings. (Bartelds & Wieling, 2022; Buurke et al., 2022) In (text) corpus-based dialectometry, on the other hand, dialect distances are quantified based on the aggregate similarity of select features (Wieling & Nerbonne, 2015), which we calculate with clustering and dimensionality reduction techniques.
We expect to observe linguistic influence from Dutch and German in the corpora. An example for the phonetic level is convergence towards Standard Dutch vowels, such as producing [as] instead of [As]. The former form is the more traditional Low Saxon form, while the latter is similar to the Dutch equivalent [Als]. As the text corpora cover both sides of the border, we expect to find divergences resulting from the influences of the national languages. For example, the auxiliary verb ‘w¯erden’ is increasingly used for the future tense in German Low Saxon, while this does not happen in Dutch Low Saxon. However, influence from the national languages does not always lead to divergence. For instance, the IPP (Infinitivus pro participio) construction that is spreading in Low Saxon, is found in both Standard Dutch and Standard German.
Some changes we observe cannot straightforwardly be explained by language contact. In Gronings, which is a group of Dutch Low Saxon variants, there is an ongoing vowel change from [AI] to [OI] (Bloemhoff & Van der Kooi, 2008, p. 162), which does not seem to stem from Standard Dutch influence or influence from neighboring varieties. Similarly, German Low Saxon is losing morphological complexity in particular in case inflection (Lindow et al., 1998), whereas Standard German has preserved a four-case system.
Improving annotation procedures and modeling the variation and change supports the development of NLP tools for Low Saxon. It also offers new insights into language contact dynamics within regional languages and between the regional language and national language, which is common in Europe nowadays.