Scobbie, James M. 2006. (R) as a variable. In Keith Brown (ed.), Encyclopedia of language and linguistics, 2nd edn, vol. 10, 337-344. Elsevier.
Sebregts, Koen. 2015. The sociophonetics and phonology of Dutch r. LOT.
Sebregts, Koen, Hans Van de Velde & Roeland van Hout (2023). Sociophonetics and rhotics. In: C. Strelluf (ed.), The Routledge Handbook of Sociophonetics, p. 195-213. Routledge.
Spreafico, Lorenzo & Alessandro Vietti (eds.). 2013. Rhotics: New data and perspectives, 225-248. Bolzano University Press.
Tops, Evie. 2009. Variatie en Verandering van de /r/ in Vlaanderen. VUB Press.
Van de Velde, Hans, Evie Tops & Roeland van Hout. 2013. The spreading of uvular [R] in Flanders. In Spreafico & Vietti (2013), Van de Velde, Hans & Roeland van Hout (eds.). 2001. 'r-atics: Sociolinguistic, phonetic and phonological characteristics of /r/. Etudes & Travaux - IVLP/ULB.
Rhotics display a large amount of phonetic variation, and consequently lend themselves to becoming sociolinguistic variables, in many language varieties (Van de Velde & Van Hout (2001), Scobbie (2006), Spreafico & Vietti (2013)). Sebregts et al. (2023) demonstrate the potential of r for informing theories of variation and change and for testing sociophonetic methodologies.
One of the methodological challenges is the systematic distinction of the wide range of r variants. Even for trained phoneticians this is notoriously difficult and time-consuming. At the same time, there is a need to study r in large speech corpora. In this poster we present the RrecogniserR, a tool for the automatic classification of r variants, developed with the help of machine learning techniques, two large data sets of Dutch r-variants (45.000 tokens), and the combination of the human and computational power of a company in artificial intelligence. This tool will increase the accuracy, speed and quantity of the study of r-variation in Dutch and other languages.
The data sets used for training and testing are:
• 1904 speakers of Belgian Dutch, stratifies for age (2), gender (2) and locality (89), 12 items per speaker, 12 different r-variants (Tops 2009).
• 408 speakers of Dutch, from 6 urban centres in the Netherlands and 4 in Belgium, stratified for age (2) and gender (2), 55 items per speaker, 20 different r-variants (Sebregts 2015).
Both data sets contain word list data.
In a pilot study on the Tops (2009) data set, with 1972 test items (8 words from 18 localities, 11 variants), a pretrained multilingual wav2vec2 model (Facebook), and a 200ms fragment centered on /r/ as detected by forced alignment, we obtained almost 70% correct classification for 11 variants, and 98% correct classification of place of articulation (front vs. back).
In the poster we will focus on the best working technique(s), and discuss issues as data augmentation for low frequency variants, size of training sets, selection of training sets, different models, the optimal extraction technique, and evaluate the performance of the RrecogniserR, also on other words (i.e., not in the training set), other speakers, other localities/varieties, and other languages.