Dialect or not? How to identify the use of dialect in written online communication
2022-04-14, 09:30–10:00 (Europe/Vienna), Room 3


One commonly observed characteristic of written computer-mediated communication (CMC) is the use of non-standard language. There are different motives for making use of the medium’s more liberal writing conventions, among which the expression of one’s own linguistic identity and/or of one's affiliation to a special social group is one of the most cited when it comes to the use of vernacular languages or the switch between different languages and varieties (e.g. Tagg & Seargeant 2014, Schreiber 2015). However, even though it is easy to identify non-standard spellings and it might also be easy for a member (or observer) of a social group to recognize and understand their local dialect spellings, from a scientific point of view, it is a methodological challenge to clearly assign those non-standard spelling variants to a specific local variety. Although linguists often presume a repetition of spoken features in spelling and interpret non-standard variants as a reproduction of spoken dialect if phonetic features are reflected in spelling (e.g. Ziegler 2005, Tophinke 2008), this approach is based on interpretation rather than on scientific evidence proved by the data.
We present a study that addresses this problem by analysing German Facebook texts coming from the DiDi corpus of South Tyrolean CMC Data (Frey, Glaznieks & Stemle 2016) using data-driven methods. The DiDi corpus provides access to more than 23.000 mainly German status updates, comments and chat messages of around 120 writers written in the year 2013 (corpus size in tokens: ca 374.000). In addition, the corpus provides person-related metadata, such as gender, age and geographic origin, which are relevant variables for language variation (Löffler 2003). By correlating frequently occurring spelling variants of the Standard German -er suffix in the DiDi corpus to geographic, social and situational variables, Glück and Glaznieks (2019) were able to relate one variant (-o) to a specific geographic area (Val Pusteria) with a typical distribution for dialect use confirmed by the variables gender (i.e. more often used by males, cf. also Sieburg 1992), age (i.e. less used by people between 30-60, cf. Vergeiner et al. in press) and communication type (i.e. more often used in chat messages). In this presentation we will extend this approach with new data by relying on cooccurring features on grapheme, word and text level. Using methods from natural language processing and social network analysis, we will investigate variants of the most common words that show the -o suffix in a group of writers from Val Pusteria in order to establish other dialect features in the corpus, that are to date not scientifically identified as such. The network analysis methods will furthermore allow to consider the number of variants a writer shows in his/her text and will enable us to determine the consistency of writers and whether writers clearly distinguish between two (or more) varieties (e.g. standard and regional dialect).


Frey, Jennifer-Carmen/Glaznieks, Aivars/Stemle, Egon W. (2016): „The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts“. In: Corazza, Anna/Montemagnis, Simonetta/Semeraro, Giovanni (eds.): Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016. Torino, Accademia University Press: 157–161.Schreiber, B. R. (2015). “I am what i am”: Multilingual identity and digital translanguaging. Language Learning and Technology, 19(3), 69–87.
Löffler, Heinrich (2003): Dialektologie. Eine Einführung. Tübingen: Narr.
Schreiber, B. R. (2015). “I am what i am”: Multilingual identity and digital translanguaging. Language Learning and Technology, 19(3), 69–87.
Sieburg, Heinz (1992): Geschlechtstypischer Dialektgebrauch. Empirische Untersuchungen verschieden-geschlechtlicher Geschwister in der Ortschaft Fritzdorf (Rhein-Sieg-Kreis). Köln/Weimar/Wien: Böhlau. (= Rheinisches Archiv 129).
Tagg, Caroline, & Seargeant, Philip (2014): Audience design and language choice in the construction and maintenance of translocal communities on social network sites. In Seargeant, Philip/Tagg, Philip (eds.): The Language of Social Media. Palgrave Macmillan: 161–185.
Tophinke, Doris (2008): „Regional schreiben: Weblogs zwischen Orthographie und Phonographie“. In: Christen, Helen/Ziegler, Evelyn (eds.): Sprechen, Schreiben, Hören – Zur Produktion und Perzeption von Dialekt und Standardsprache zu Beginn des 21. Jahrhunderts. Beiträge zum 2. Kongress der Internationalen Gesellschaft für Dialektologie des Deutschen. Wien, Praesens: 153–180.
Vergeiner, Philip/Wallner, Dominik/Bülow, Lars/Scheutz, Hannes (in Vorbereitung): Redia-lektisierung und Alter. Ergebnisse einer real-time-Studie zum Age-grading in Ulrichsberg. In: Christen, Helen/Ganswindt, Brigitte/Herrgen, Joachim/Schmidt, Jürgen Erich (eds.): Regiolekt – Der neue Dialekt? Akten des 6. Kongresses der Internationalen Gesellschaft für Dialektologie des Deutschen (IGDD)
Ziegler, Evelyn (2005): „Die Bedeutung von Interaktionsstaus und Interaktionsmodus für die Dialekt-Standard-Variation in der Chatkommunikation“. In: Eggers, Eckhard/Schmidt, Jürgen Erich/Stellmacher, Dieter (eds.): Moderne Dialekte – Neue Dialektologie. Akten des 1. Kongresses der Internationalen Gesellschaft für Dialektologie des Deutschen (IGDD). Stuttgart, Steiner: 719–745. (= Zeitschrift für Dialektologie und Linguistik. Beihefte 130).

Aivars is a senior researcher at the Institute for Applied Linguistics of Eurac Research Bozen/Bolzano. His research focuses on sociolinguistic aspects of the use of registers in CMC. He has been responsible for the creation of the multilingual CMC corpus DiDi, a corpus linguistic collection of facebook texts of writers from the multilingual Italian province of South Tyrol. Currently, he is working on the description of written German dialect varieties in CMC. As he always needs more authentic data for his work, he is happy to collect new data whenever possible in order to produce and analyse new corpora!

He has worked on the creation of several learner corpora in recent years, such as LEONIDE and KoKo, and combines writing research and corpus linguistics in his research activities.

Aivars is also the managing editor for the new Eurac Research Learner Corpus Platform PORTA.