EDAudio: Easy Data Augmentation Techniques for Audio Classification


Panel Affiliation

Embracing Variability in Natural Language Processing

References
  • Wei, Jason, and Kai Zou. "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
  • Ferreira-Paiva, Lucas, et al. "A Survey of Data Augmentation for Audio Classification."
  • Schmidt, Jürgen Erich, et al. "Regionalsprache. de (REDE). Forschungsplattform zu den modernen Regionalsprachen des Deutschen." Research Center Deutscher Sprachatlas, Marburg (2008ff.).
  • Mann, Henry B., and Donald R. Whitney. "On a test of whether one of two random variables is stochastically larger than the other." The annals of mathematical statistics (1947): 50-60.
Abstract

This project aims to develop a reliable classifier for German dialects using data augmentation methods. To achieve this, we propose EDAudio, a set of techniques for augmenting the training data for audio classification. Inspired by Wei & Zou (2019), EDAudio performs random insertion, deletion, substitution and swapping of audio segments taken from the training data. It also includes additional hyperparameters, such as the duration of the above-mentioned random operations, the percentage of audio length to augment from each audio, and the count of augmented audios resulting from the augmentation process. As a result, we carry out an ablation study of these hyperparameters for each individual operation in isolation, and where applicable, we conduct a comparative analysis with similar methods.
To show the effectiveness of these techniques, we perform experiments with different dialects of German and compare the results with some of the well-known data augmentation methods for audio as described by Lucas Ferreira-Paiva et al. (2022).
The audio recordings used as the dataset are from the REDE corpus (Schmidt et al. 2008ff.), which contains speech recordings of male speakers of different generations from 150 places in Germany. As the first approach, we use speakers of the older generation (+65 years) for training data. The informants were asked to translate so-called Wenker-sentences (https://www.uni-marburg.de/en/fb09/dsa/research-documentation-center/wenkersaetze) into their own dialect. Given that the audio recordings include the entire conversational context involving both the informant and the explorer, it is necessary to isolate the informant's part from the audio. For this purpose, we initially perform a manual segmentation of two dialects (21 locations and 35 speakers). Subsequently, we also utilise automatic segmentation whose accuracy is verified by comparing it with the manual segmentation. Once validated, all subsequent tests and analyses are performed using the automatic segmentation.
To ensure reliable results and robust comparisons, we perform multiple individual runs for each test. In each run, we select all audio files from two randomly chosen speakers for validation, and two additional speakers' audio files for testing. The resulting individual F1-scores from each run are then used to compute the mean and median F1-scores. To validate the significance of diverse tests and confirm their implications, we employ the Mann-Whitney U-test (Mann & Whitney, 1947) on all F1-scores obtained from the respective tests.