Selecting representative survey sites from a large-scale dialectological study: A comparison of clustering methods
2022-04-12, 14:30–15:00 (Europe/Vienna), Room 1

Reducing an existing network of survey sites for the purpose of finding patterns, representative data or partial revisitation has long been an issue in geospatial analysis. Picking the survey sites for a dialect survey, often based on data from previous surveys, is a central issue in dialectology, however it is often done solely qualitatively, i.e. by deciding new survey sites based on experts’ knowledge of the dialect landscape. In other respects, purely spatial techniques were applied, such as random sampling within a grid, equidistant sampling, which disregarded the linguistic data already recorded, and the regional variation present. Dialectometry applies clustering methods to survey sites usually for the analysis of areal patterns and patterns in attribute space. Most of such analyses utilise hierarchical clustering, despite the existence of further methods (e.g. partitional, density-based etc.). Prokić and Nerbonne (2008) compare different clustering methods for classification tasks. Despite being successfully used for such analyses, clustering methods have not often been applied for the task of survey site selection. In this contribution, we intend to address this gap by comparing multiple clustering methods in a site selection task.

We demonstrate different approaches with a real dialectological aim in mind, using real dialect atlas data. The survey sites of the monumental ‘Sprachatlas der Deutschen Schweiz’ (SDS) are subsampled to correspond to the sociolinguistic aims, manpower and financial resources of the project ‘Swiss German Dialects Across Time and Space’ (SDATS - SDATS aims for a large-scale comparison of the contemporary state of Swiss German dialects to that of c. 1950, with an increased number of speakers per location (8 speakers of different social backgrounds instead of 2-3 NORMs). Because of sociodemographic changes in Switzerland, such as increased mobility and (sub)urbanisation, experts assume dialect levelling (cf. Christen, 1998) to have taken place. Due to such linguistic changes, the number of survey sites needed for SDATS was reduced from the tight-knit network of 565 SDS sites to the most representative 125 sites, using digitised data (Scherrer & Stoeckle, 2016).

We present best practices for this typical dialectological task and demonstrate the effects of different kinds of methods along with validation approaches. We investigate different methods used in dialectometry and biology for spatial analysis, and compare their effects and usability for the problem of site reduction. The methods demonstrated include hierarchical clustering with different algorithms (e.g., UPGMA and Ward’s), a partitioning clustering algorithm (partitioning around medoids), a method from botany (Lengyel et al., 2011) and validation methods, such as bootstrapping, noisy clustering and cophenetic correlation coefficients (Nerbonne et al. 2008, Birkenes, 2019).

As every intended study and their original support data are different, there is no one-size-fits-all solution to site selection, thus we hope to give a general overview useful for the linguistic community for finding representative candidate survey sites.


Birkenes, Magnus B., 2019. North Frisian dialects: A quantitative investigation using a parallel corpus of translations. Us Wurk. Ttydskrift Foar Frisistyk / Journal of Frisian Studies, 68(3-4.), 119–168.

Christen, Helen, 1998. Convergence and divergence in the Swiss German dialects. Folia Linguistica, 32(1–2), 53–68.

Leemann, Adrian, Péter Jeszenszky, Carina Steiner, Melanie Studerus, Jan Messerli, 2020. SDATS Corpus – Swiss German Dialects Across Time and Space. Retrieved from

Lengyel, Attila, Milan Chytrỳ & Lubomír Tichỳ, 2011. Heterogeneity-constrained random resampling of phytosociological databases. Journal of Vegetation Science, 22(1), 175–183.

Nerbonne, John, Peter Kleiweg, Wilbert Heeringa & Franz Manni, 2008. Projecting dialect distances to geography: Bootstrap clustering vs. noisy clustering. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.), Data Analysis, Machine Learning and Applications (pp. 647–654). Berlin, Heidelberg: Springer.

Prokić, Jelena & John Nerbonne, 2008. Recognising groups among dialects. International Journal of Humanities and Arts Computing, 1, 153–172.

Scherrer, Yves & Philipp Stoeckle, 2016. A quantitative approach to Swiss German ‒ Dialectometric analyses and comparisons of linguistic levels. Dialectologia et Geolinguistica, 24, 92–125.

SDS = Sprachatlas der deutschen Schweiz, 1962-2003., Rudolf Hotzenköcherle, Robert Schläpfer, Rudolf Trüb, Peter Zinsli (Eds.) Bern (I–VI)/ Basel(VII–VIII): Francke.

This speaker also appears in: