Large language models and small language varieties: Challenges and current methods

Panel Affiliation

Embracing Variability in Natural Language Processing


Over the last decades, Natural Language Processing (NLP) has largely focused on standard languages with many speakers and abundant data. More recently, there is an increasing interest in moving away from the treatment of languages as monoliths, towards recognizing and modelling the linguistic variability inherent in how many of us, including speakers of minority and non-standard languages, actually use language. From an NLP perspective, languages with few speakers and/or no generally accepted standard are interesting in that they require learning from data that are sparse as well as heterogeneous. This is in stark contrast to large language models (LLMs) trained on massive amounts of data, much of which follows some de-facto standard.

In this talk, we reflect on this asymmetry of how to process small languages with large language models. We summarize the current challenges that NLP research is facing when it comes to creating models for highly variable or low-resource language varieties and present approaches for tackling these challenges along three major dimensions: resources (and resource awareness), modeling non-standard data and human-centric design.

Firstly, quantitative studies on non-standard languages have mostly been carried out by researchers largely belonging to two disjoint communities: the ones studying linguistic variation and language varieties spoken by small speaker communities vs. those focusing on language processing. While the lack of resources is severe, researchers might also not be aware of resources created outside of their own community. A recent documentation effort on Germanic datasets illustrates the dearth of resources, with datasets often small in size, largely unannotated and not always easy to find (Blaschke et al., 2023a).

Secondly, current methods in NLP are not equipped to gracefully handle input variation. This is a problem when encountering data with linguistic features new to the model, and non-standard orthographies pose an especially large problem. The currently popular methods for encoding textual input rely on splitting the text into character sequences based on frequency statistics from large corpora – but texts written to reflect phonetic variation (Samardžić et al., 2015) or in otherwise non-standard orthographies (Nguyen et al., 2020) abound with character-level variation rarely seen in the training data and thus get encoded suboptimally.

Thirdly, not all NLP tools are built for the right reasons (Bird, 2020). Especially for minority and non-standard languages it is important to involve the speakers of the communities – not only to avoid issues in data quality (Kreutzer et al., 2022) but, crucially, to ensure the tools are of actual use to the speakers (Lent et al., 2022).

After outlining the major challenges, we will provide a snapshot of current solutions: from work on surveying existing resources (Zampieri et al., 2020; Blaschke et al., 2023a) to work on evaluating and improving the robustness of state-of-the-art NLP methods (Ziems et al., 2023; Aepli & Sennrich, 2022; Blaschke et al., 2023b; Sun et al., 2023). Overall, it is important to highlight interdisciplinary approaches and build on linguistic insights. Finally, involving members of the speaker communities both in the design and evaluation of systems is critical.