Exploring the parsed corpus of Southern Dutch Dialects (GCND)


References

Farasyn, Melissa, Ghyselen, Anne-Sophie, Van Keymeulen Jacques & Anne Breitbarth. 2022. Challenges in tagging and parsing spoken dialects of Dutch. Journal of Historical Syntax6 (4–11). doi:10.18148/hs/2022.v6i4-11.92. [Special issue ‘Annotating Historical Corpora‘].

Ghyselen, Anne-Sophie., Van Keymeulen, Jacques, Farasyn, Melissa, Hellebaut, Lien, & Anne Breitbarth. 2020. Het transcriptieprotocol van het Gesproken Corpus van de Nederlandse Dialecten (GCND). BULLETIN DE LA COMMISSION ROYALE DE TOPONYMIE & DIALECTOLOGIE (PRINTED) = HANDELINGEN VAN DE KONINKLIJKE COMMISSIE VOOR TOPONYMIE & DIALECTOLOGIE, 92, 83–115.

Breitbarth, Anne, Melissa Farasyn, Anne-Sophie Ghyselen & Jacques Van Keymeulen. 2020. “Het Gesproken Corpus van de zuidelijk-Nederlandse Dialecten”. Handelingen van de Konkinklijke Zuid-Nederlandse Maatschappij voor Taal- en Letterkunde en Geschiedenis (KZM) LXXII: 23–38.

Ghyselen, Anne-Sophie, Anne Breitbarth, Melissa Farasyn, Jacques Van Keymeulen & Arjan van Hessen. 2020. “Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case-Study.” Frontiers in Artificial Intelligence 3:10. doi: 10.3389/frai.2020.00010.

Van Keymeulen, Jacques, Veronique De Tier, Anne Breitbarth, Anne-Sophie Ghyselen & Melissa Farasyn. 2019. “Het Dialectologische Corpus ‘Stemmen Uit Het Verleden’ van de Universiteit Gent.” Volkskunde 120 (2): 193–204.

Abstract

Annotated corpora of spontaneous speech are a rarity compared to corpora of written language, with parsed dialect corpora being even scarcer. However, these spoken (dialect) corpora are indispensable for understanding language structure, variation, and change.

The parsed corpus of Southern Dutch Dialects (GCND) is the first of its kind for spoken dialects of Dutch, and offers a unique opportunity to explore language variation. The project aims at making accessible a unique collection of dialect recordings from 768 places in Belgium, the north of France and the south of the Netherlands. The informants are generally non-mobile, rural, uneducated and born around 1900. The recordings were made by dialectologists from Ghent University and 740 of those recordings were recorded between 1963 and 1976. For the GCND the Ghent collection is being complemented with new recordings (28 recordings from Brussels, Flemish Brabant and Limburg) and recordings from the Meertens Institute (73 recordings from the south of the Netherlands).

The GCND distinguishes itself by exclusively focusing on spontaneous speech. All recordings are transcribed using a newly developed two-tier protocol. Furthermore, they were linguistically annotated (i.e. enriched with information on the word class of the individual words (‘part-of-speech tags’) and the syntactic functions of word groups and their underlying relationship (‘parsing’)) using the Alpino parser, which was originally developed for standard Dutch. Besides that (semi-)automatic annotation, a lot of pre-processing and post-processing was needed to deal with dialectal peculiarities not found in Standard Dutch. These annotations are complemented by extensive manuals explaining dialectal syntactic peculiarities.

By May 2023, the corpus, the query tools and the documentation will be made available online. For this purpose, several query tools have been developed to search the GCND for word forms and syntactic patterns. ICLaVE|12 provides the perfect opportunity to present the new corpus and associated search tools to an interested audience. In this multimedia presentation, we will therefore present the corpus and the associated query tools developed by the Dutch Language Institute (INT) for searching the corpus. The transcriptions are aligned with the audio and allow the user to consult not only the linguistic annotation but also the corresponding audio clip for search results. By linking to an extensive metadata database, it is possible to filter on a wide range of relevant parameters (such as age, gender, place of growing up, education, profession...).

In this way, the GCND will be an unparalleled resource, promising a deeper understanding of Dutch dialects and their linguistic richness.