Devil in the details: Georeferencing the historical dialect corpora of German
2022-04-12, 14:30–15:00 (Europe/Vienna), Room 5

Georeferencing corpora can facilitate many aspects of linguistic and digital humanities research, from richer querying to new types of research about the representation of places in texts [1,2,3,6]. We report on ongoing work on georeferencing place metadata at the Archive for Spoken German. Our goal is to tie every place datum to a pair of geographic coordinates. Towards this end, we link our locations to entities for administrative places in the Wikidata knowledgebase [7], from which we can subsequently retrieve coordinate data (and much more).

Our data represents 13 variation corpora, including the large historic German dialect corpora such as "Deutsche Mundarten (Zwirnerkorpus)" [8]. These corpora are available to the scientific public through the Database of Spoken German [4,5]. The variation corpora amount to 8829 recordings (2352 hours). The oldest of the 10215 different speakers was born in 1861 and the earliest recording dates from 1955. Central and Eastern European locations predominate but the corpora also cover varieties in the Americas, Africa and Asia. The place metadata concern the recording event and the biographies of the speakers and their associates (e.g. places of birth, schooling, etc.). Altogether, the corpora contain 86091 instances of place metadata (12975 distinct locations). Georeferencing them needs to be automated as much as possible.

We employ a semi-automatic, iterative approach. Using the openrefine tool for data cleanup and transformation, we reconcile territorial units to Wikidata entities, proceeding from higher levels (e.g. country) to lower ones (e.g. province), using the previously reconciled information as a feature for the disambiguation at the next lower level. While mostly automatic linking succeeds based on the place's name and its embedding in higher-level administrative units, a significant number of items require human intervention. We can sort these problematic cases into several categories. First, we face problems of data capture/transmission. For instance, place names may have been misspelled at the time of recording. Second, toponyms are subject to change for various complex social and political reasons (e.g. Chemnitz/Karl-Marx-Stadt). Third, for locations outside the contiguous German-speaking areas of Central Europe, names often exist in multiple languages, which we need to keep track of. Fourth, administrative divisions are changeable due to war or administrative reform. Some place metadata are thus anachronistic, whereas Wikidata (and most geographic databases) represent present-day geography only.

We address the problematic cases in a manual post-editing step. Proceeding country, province/region and district-wise, human editors consult additional historical maps, gazetteers and geographic databases but also interpret the relationship between places to disambiguate locations. For instance, if the place at issue is a unique speaker’s birth place and all other associated places such as school and work places lie close to each other, then out of several candidates the true birth place is probably the one closest to those other places. We discuss the challenges and the effort required for reconciliation using German and Czech places in two case studies.


