The Boarnsterhim Corpus 2.0


References

Feitsma, A., Van der Geest, E., Van der Kuip, F. & Meekma, I. (1987). Variations and development in Frisian sandhi phenomena. International journal of the Sociology of Language, 64, 81–94. https://doi.org/10.1515/ijsl.1987.64.81.

Sloos, M., Heeringa, W. & Drenth, E. (2018). The Boarnsterhim Corpus: A bilingual Frisian-Dutch panel and trend study. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018) (pp. 2174-2181).

Van der Kuip, F. (1986). Syllabisearring yn it Frysk en it Hollânsk fan fryskpraters. Tydskrift foar Fryske Taalkunde, 2(3), 69–92.

Abstract

This poster has a two-fold aim: (1) introduce the Boarnsterhim Corpus, a unique apparent and real time corpus of the speech of Dutch-Frisian bilinguals; and (2) discuss and solve the problems that show up with the publication of existing corpora under present-day scientific, ethical and privacy guidelines.

The Boarnsterhim Corpus (BHC) consists of data recorded in 1982-1984 (BHC1, see Van der Kuip (1986) and Feitsma et al. (1987)) and a replication study 35 years later in 2017-2019 (BHC2, see Sloos et al. (2018)) in the municipality Boarnsterhim in Friesland (The Netherlands). The apparent time design of BHC1 was implemented by including three generations of the same family, along a paternal (grandfather, son, grandson) or a maternal (grandmother, daughter, granddaughter) line. The replication study BHC2 used the same design, and some of the participants of BHC1 were interviewed again (obtaining panel data). In some cases, it was even possible to add a fourth generation within a family. Given the birth cohort of the first generation in BHC 1 (1897-1920), we were not able to re-record these speakers. The BHC consists of 112 bilingual speakers, born between 1897 and 2001. It contains about 80 hours of Frisian and 80 hours of Dutch speech. In every interview session both spontaneous speech (collected by means of semi-structured interviews) and reading style data were collected.

The BHC corpus of Frisian and Dutch, two closely related languages, is an excellent source for the study of language variation and change in the two languages of bilinguals, the role of language transfer in bilinguals, the role of literacy (most speakers are less literate in Frisian than in Dutch), and whether and how variation patterns are transferred from one language to another.

The second part of the poster will focus on issues with the release of the BHC, and more generally existing speech corpora. The BHC was put online by the PI responsible for BHC2 in 2019, so that other researchers would be able to use the corpus. However, it was decided to take the corpus offline and start a revision of the data set to correct numerous errors and inconsistencies in orthographic transcriptions, to correct metadata, to remove speakers that should not have been included in the corpus, and to bring the corpus in line with current scientific, ethical and privacy guidelines. In our poster we will give insight in the procedures we followed, and formulate some guidelines and tips and tricks for other researchers.