Devil in the details: Georeferencing the historical dialect corpora of German
2022-04-12, 14:30–15:00 (Europe/Vienna), Room 5

Georeferencing corpora can facilitate many aspects of linguistic and digital humanities research, from richer querying to new types of research about the representation of places in texts [1,2,3,6]. We report on ongoing work on georeferencing place metadata at the Archive for Spoken German. Our goal is to tie every place datum to a pair of geographic coordinates. Towards this end, we link our locations to entities for administrative places in the Wikidata knowledgebase [7], from which we can subsequently retrieve coordinate data (and much more).

Our data represents 13 variation corpora, including the large historic German dialect corpora such as "Deutsche Mundarten (Zwirnerkorpus)" [8]. These corpora are available to the scientific public through the Database of Spoken German [4,5]. The variation corpora amount to 8829 recordings (2352 hours). The oldest of the 10215 different speakers was born in 1861 and the earliest recording dates from 1955. Central and Eastern European locations predominate but the corpora also cover varieties in the Americas, Africa and Asia. The place metadata concern the recording event and the biographies of the speakers and their associates (e.g. places of birth, schooling, etc.). Altogether, the corpora contain 86091 instances of place metadata (12975 distinct locations). Georeferencing them needs to be automated as much as possible.

We employ a semi-automatic, iterative approach. Using the openrefine tool for data cleanup and transformation, we reconcile territorial units to Wikidata entities, proceeding from higher levels (e.g. country) to lower ones (e.g. province), using the previously reconciled information as a feature for the disambiguation at the next lower level. While mostly automatic linking succeeds based on the place's name and its embedding in higher-level administrative units, a significant number of items require human intervention. We can sort these problematic cases into several categories. First, we face problems of data capture/transmission. For instance, place names may have been misspelled at the time of recording. Second, toponyms are subject to change for various complex social and political reasons (e.g. Chemnitz/Karl-Marx-Stadt). Third, for locations outside the contiguous German-speaking areas of Central Europe, names often exist in multiple languages, which we need to keep track of. Fourth, administrative divisions are changeable due to war or administrative reform. Some place metadata are thus anachronistic, whereas Wikidata (and most geographic databases) represent present-day geography only.

We address the problematic cases in a manual post-editing step. Proceeding country, province/region and district-wise, human editors consult additional historical maps, gazetteers and geographic databases but also interpret the relationship between places to disambiguate locations. For instance, if the place at issue is a unique speaker’s birth place and all other associated places such as school and work places lie close to each other, then out of several candidates the true birth place is probably the one closest to those other places. We discuss the challenges and the effort required for reconciliation using German and Czech places in two case studies.


[1] Noah Bubenhofer. Geokollokationen - Diskurse zu Orten: Visuelle Korpusanalyse. Mitteilungen des Deutschen Germanistenverbandes, 61(1):45–59, 2014.
[2] Ian Gregory, David Cooper, Andrew Hardie, and Paul Rayson. Spatializing and analyzing digital texts: Corpora, GIS, and Places, pages 150–178. Indiana University Press, 2015.
[3] Ian Gregory, Christopher Donaldson, Patricia Murrieta-flores, and Paul Rayson. Geoparsing, gis, and textual analysis: Current developments in spatial humanities research. March 2015.
[4] Thomas Schmidt, Joachim Gasch, and Julia Kaiser. DGD - die Datenbank für Gesprochenes Deutsch. Neues vom heutigen Deutsch. Empirisch - methodisch - theoretisch, pages 351 – 354. de Gruyter, Berlin [u.a.], 2019.
[5] Ulf-Michael Stift and Thomas Schmidt. Mündliche Korpora am IDS: vom deutschen Spracharchiv zur Datenbank für gesprochenes Deutsch. Ansichten und Einsichten. 50 Jahre Institut für Deutsche Sprache, pages 360 – 375.
Institut für Deutsche Sprache, Mannheim, 2014.
[6] Richard Tobin, Claire Grover, Kate Byrne, James Reid, and Jo Walsh. Evaluation of georeferencing. In Proceedings of the 6th Workshop on Geographic Information Retrieval, GIR ’10, New York, NY, USA, 2010. Association for
Computing Machinery.
[7] Denny Vrandečić and Markus Krötzsch. Wikidata: A free collaborative knowledgebase. Commun. ACM, 57(10):78–85, September 2014.
[8] Eberhard Zwirner and Wolfgang Bethge. Erläuterungen zu den Texten, volume 1 of Lautbibliothek der deutschen Mundarten. Vandenhoeck & Ruprecht, Göttingen, 1958.

This speaker also appears in: