Variation is the norm: Orthographic variability and metalinguistic stance in Luxembourgish user comments


Panel Affiliation

Embracing variability in NLP

References

Lothritz, Cédric et al. (2023): Comparing Pre-Training Schemes for Luxembourgish BERT Models. Konferenz zur Verarbeitung natürlicher Sprache (KONVENS) 2023.

Purschke, Christoph (2020): Attitudes towards multilingualism in Luxembourg. A comparative analysis of online news comments and crowdsourced questionnaire data. Frontiers in Artificial Intelligence 3:536086. doi: 10.3389/frai.2020.536086.

Purschke, Christoph (2023): Diskurs-Figuren. Wie Politik und Öffentlichkeit über Sprache sprechen. Hémecht. Zeitschrift für Luxemburger Geschichte 3/2023, 310–327.

Abstract

The structural and societal development of Luxembourgish over the last decades has led to a particular situation full of contrasts: the language has seen a recent upswing in public recognition visible in increasing speaker numbers and political promotion campaigns. At the same time, Luxembourgish has come under pressure in the multilingual language regime, especially vis-à-vis French and English. Supported by the development of digital and social media, the written domain of Luxembourgish has now fully developed; at the same time, its standardization is not yet complete, and a wide range of variation exists in written texts - also because Luxembourgish is hardly part of school curricula. The current dynamics of multilingualism in the country is a recurring theme in public debates, with repercussions on the political programs of political parties; at the same time, the discourse on multilingualism and its expansion has been treading water for decades (Purschke 2023).

The TRAVOLTA project is set against this complex backdrop. It is the first project to systematically study the development of written language in Luxembourgish over the last 20 years, considering public debates about the social significance of the language. Based on the most extensive available corpus of written texts, the complete online archive of RTL.lu, the project analyzes, on the one hand, individual and group-based writing practices and spaces of variation over time (Purschke 2020). In particular, comparing private texts (user comments) and professional products (journalistic articles) provides evidence for a domain-specific development of written Luxembourgish. The second focus of the project is the analysis of discourses on multilingualism and the role of Luxembourgish on the RTL platform. Using the example of user comments, the main focus here is on various procedures of individual stance-taking and their reflection in collective discourses about language. The project relies on a combination of variationist and sociolinguistic methods with machine learning and NLP methods.

In this talk, we present the project's first results, organized around two doctoral theses. In the first phase of the project, the main goal is to develop a set of analytical tools tailored to the particular situation of Luxembourgish, which will allow us to handle the enormous amount of variation in written Luxembourgish and to take it into account when developing language technological solutions. Unlike for larger languages with completed standardization and many NLP resources, for Luxembourgish, there are, so far, hardly any language models or larger annotated data sets, also because factors such as variation and code-switching make model training challenging. At the same time, Luxembourgish is an ideal test case for models and theories from NLP because a small language with a lot of input variation can be used to evaluate the state-of-the-art in automatic language processing systematically and to extend it in the sense of a linguistically and demographically more diverse technology (Lothritz et al. 2023). In addition to data analysis, TRAVOLTA, therefore, also aims to develop new language technology resources for Luxembourgish as a small-language testbed.