With a focus on the practices and perceptions of code-switching (CS) in written Luxembourgish (a West Germanic language with more than 400.000 speakers), this research project contributes to the understanding of multilingualism in a diverse, multicultural society.
Luxembourg, with its three official languages Luxembourgish, German, and French boasts a profoundly multilingual environment, where English also assumes a crucial role as lingua franca. This diversity is mirrored in computer-mediated communication, e.g., in complex forms of switching between different languages. The objective of this study is to investigate practices and perceptions of CS in written Luxembourgish, with a particular focus on digital forms of interaction. The data for the study stems from the RTL news platform, the main public and commercial broadcasting channel in the country and a steady driving force for written Luxembourgish.
To trace and model CS in a complex multilingual setting, I am developing a mixed-methods approach combining sociolinguistic methods (perception study, interviews) with Natural Language Processing techniques. In doing so, I am linking the computational analysis of CS patterns in written communication with qualitative interviews investigating the reasons and circumstances behind its usage. Additionally, I am using user-generated annotations to study individual perceptions of CS in written Luxembourgish. These annotations can also be used as training data to build a context-sensitive language identification tool to automatically detect CS for Luxembourgish. In this way, the study is also contributing to a better representation of small and lesser-resourced languages in the research landscape. The challenges in this endeavor primarily stem from the substantial amount of variation in written Luxembourgish that include orthography (due to the absence of a widely established spelling norm) next to other sources of variation (dialects, language contact, online language). Hence, the combination of qualitative and quantitative methods not only helps with clarifying the role of CS in Luxembourgish but also helps mitigating some of the challenges in computational language processing by dint of perception testing to produce fine-grained annotations of CS.
In this poster presentation, my primary focus will be on the methodological facets of my research, initial findings, and the challenges encountered while developing computational tools for the study of Luxembourgish. Specifically, I will discuss the development and testing of a language identification tool for CS in written text. Using a variety of existing NLP tools and pipelines, such as Langdetect, LangID, Fasttext, Spacy, and Polyglot, it can be shown that the automatic identification of CS in Luxembourgish text depends on a couple of text-related factors such as length, switching languages and orthographic correctness, while the existing NLP tools yield an overall stable performance for shorter snippets of texts. Using a mixed-method approach is beneficial for this task due to the contextual grounding and individual encoding of CS in interaction, especially in a multilingual setting, where the linguistic classification of CS is often complicated. In this way, the study does not only provide technical solutions for the study of CS in written language (automatic CS detection), but also contributes to the theoretical discussion around CS by providing multi-facetted insights into the pragmatic organization and individual perception of CS in a highly multilingual setting.