Embracing Variability in Natural Language Processing
References –Aepli, Noëmi & Rico Sennrich. 2022. Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise. In Findings of the Association for Computational Linguistics: ACL 2022, 4074–4083. Association for Computational Linguistics.
Bird, Steven. 2020. Decolonising Speech and Language Technology. In Proceedings of the 28th International Conference on Computational Linguistics, 3504–3519. International Committee on Computational Linguistics.
Blaschke, Verena, Hinrich Schütze & Barbara Plank. 2023a. A Survey of Corpora for Germanic Low-Resource Languages and Dialects. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 392–414. University of Tartu Library.
Blaschke, Verena, Hinrich Schütze & Barbara Plank. 2023b. Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), 40–54. Association for Computational Linguistics.
Kreutzer, Julia, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, et al. 2022. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics 10, 50–72.
Lent, Heather, Kelechi Ogueji, Miryam de Lhoneux, Orevaoghene Ahia & Anders Søgaard. 2022. What a Creole Wants, What a Creole Needs. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 6439–6449. European Language Resources Association.
Nguyen, Dong & Jack Grieve. 2020. Do Word Embeddings Capture Spelling Variation?. In Proceedings of the 28th International Conference on Computational Linguistics, 870–881. International Committee on Computational Linguistics.
Samardžić, Tanja, Yves Scherrer & Elvira Glaser. 2015. Normalising orthographic and dialectal variants for the automatic processing of Swiss German. In Proceedings of The 4th Biennial Workshop on Less-Resourced Language, 294–298. European Language Resource Association.
Sun, Jiao, Thibault Sellam, Elizabeth Clark, Tu Vu, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein & Sebastian Gehrmann. 2023. Dialect-robust Evaluation of Generated Text. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6010–6028. Association for Computational Linguistics.
Zampieri, Marcos, Preslav Nakov & Yves Scherrer. 2020. Natural language processing for similar languages, varieties, and dialects: A survey. Natural Language Engineering, 26(6), 595–612. Cambridge University Press.
Ziems, Caleb, William Held, Jingfeng Yang, Jwala Dhamala, Rahul Gupta & Diyi Yang. 2023. Multi-VALUE: A Framework for Cross-Dialectal English NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 744–768. Association for Computational Linguistics.
Over the last decades, Natural Language Processing (NLP) has largely focused on standard languages with many speakers and abundant data. More recently, there is an increasing interest in moving away from the treatment of languages as monoliths, towards recognizing and modelling the linguistic variability inherent in how many of us, including speakers of minority and non-standard languages, actually use language. From an NLP perspective, languages with few speakers and/or no generally accepted standard are interesting in that they require learning from data that are sparse as well as heterogeneous. This is in stark contrast to large language models (LLMs) trained on massive amounts of data, much of which follows some de-facto standard.
In this talk, we reflect on this asymmetry of how to process small languages with large language models. We summarize the current challenges that NLP research is facing when it comes to creating models for highly variable or low-resource language varieties and present approaches for tackling these challenges along three major dimensions: resources (and resource awareness), modeling non-standard data and human-centric design.
Firstly, quantitative studies on non-standard languages have mostly been carried out by researchers largely belonging to two disjoint communities: the ones studying linguistic variation and language varieties spoken by small speaker communities vs. those focusing on language processing. While the lack of resources is severe, researchers might also not be aware of resources created outside of their own community. A recent documentation effort on Germanic datasets illustrates the dearth of resources, with datasets often small in size, largely unannotated and not always easy to find (Blaschke et al., 2023a).
Secondly, current methods in NLP are not equipped to gracefully handle input variation. This is a problem when encountering data with linguistic features new to the model, and non-standard orthographies pose an especially large problem. The currently popular methods for encoding textual input rely on splitting the text into character sequences based on frequency statistics from large corpora – but texts written to reflect phonetic variation (Samardžić et al., 2015) or in otherwise non-standard orthographies (Nguyen et al., 2020) abound with character-level variation rarely seen in the training data and thus get encoded suboptimally.
Thirdly, not all NLP tools are built for the right reasons (Bird, 2020). Especially for minority and non-standard languages it is important to involve the speakers of the communities – not only to avoid issues in data quality (Kreutzer et al., 2022) but, crucially, to ensure the tools are of actual use to the speakers (Lent et al., 2022).
After outlining the major challenges, we will provide a snapshot of current solutions: from work on surveying existing resources (Zampieri et al., 2020; Blaschke et al., 2023a) to work on evaluating and improving the robustness of state-of-the-art NLP methods (Ziems et al., 2023; Aepli & Sennrich, 2022; Blaschke et al., 2023b; Sun et al., 2023). Overall, it is important to highlight interdisciplinary approaches and build on linguistic insights. Finally, involving members of the speaker communities both in the design and evaluation of systems is critical.