UDE Grime Project: "Issa badman ting, cuh!": Using topic modeling to quantify vernacularity in Grime song lyrics

The burgeoning field of Natural Language Processing (NLP) has spawned a plethora of innovative ways to analyse text. Despite their potential relevance for linguistic purposes, most of these innovations have yet to find their way into the linguist’s methodological toolbox (Taboada 2016). This paper focuses on Grime song lyrics by using topic modeling – an NLP method to detect semantically coherent topics in unstructured textual data (Liske 2018). Grime is a “hyper-local”, MC-led musical genre that originated in the streets of East London in the early 2000s (Hancox 2018). Over the course of the last two decades, it has gradually become an integral part of Great Britain’s youth culture.

This contribution forms part of the UDE Grime Project, a broader study on the interrelation of Grime music and Multicultural London English (Cheshire et al. 2011) and Multicultural Urban British English (Drummond 2017). Overall, the project has three distinct aims:
(a) to examine how identity and street credibility are constructed by using varying degrees of vernacularity and to explore the interplay of authenticity, identity and youth-street styles, particularly in the context of a largely DIY music production scheme,
(b) to assess the extent to which the genre’s linguistic repertoire draws on Multicultural London English (Cheshire et al 2011) and Multicultural Urban British English (Drummond 2017), and
(c) to propose a reproducible analytical framework for exploring MC-led musical genres.

Using data of the UDE Corpus of Grime (UCG), a tidy (Wickham 2014) and fully annotated 3 million word corpus of over 5.000 songs, I explore the question of whether stylistic shifts towards more vernacular linguistic performance are driven by topic development. The topic models were built using Latent Dirichlet Allocation (LDA), an unsupervised, iterative machine learning algorithm that assumes that each document is a collection of topics and each topic, consequently, is a collection of words (Silge & Robinson 2017). The contexts of the fifty most frequent key terms per topic were then assigned a vernacularity index, a metric that assesses the degree of vernacularity by counting colloquial items surrounding the keyword in question (Thompson 2017). Preliminary findings show that among the highest-ranking topics of "violence/threat of violence" and "territoriality/postcode war", quite surprisingly, "flirting/relationships" and "sexual activities" have a high likelihood of eliciting vernacular linguistic behaviour. Furthermore, the results indicate that there is a significant covariation between the variables “topic” and “year” in that there has been a dramatic increase in misogynistic context patterns both throughout the entire sample and across numerous topics over the last few years.


Cheshire, Jenny, Paul Kerswill, Sue Fox and Eivind Torgersen. 2011. Contact, the feature pool and the speech community: The emergence of Multicultural London English. Journal of Sociolinguistics, 15(2). 151-196.
Drummond, Rob. 2017. (Mis)interpreting urban youth language: white kids sounding black?. Journal of Youth Studies, 20(5). 640-660.
Hancox, Dan. 2018. Inner City Pressure: The Story of Grime. London: William Collins.
Liske, Debbie. (2018). Machine Learning and NLP using R: Topic Modeling and Music Classification. https://www.datacamp.com/community/tutorials/ML-NLP-lyric-analysis (27 October 2020.)
Silge, Julia & David Robinson. 2017. Text Mining with R: A tidy approach. California: O’Reilly.
Taboada, Maite. 2016. Sentiment Analysis: An Overview from Linguistics. Annual Review of Linguistics, 2(1). 325-347.
Thompson, Charlie. (2017). FitteR HappieR. https://www.rcharlie.com/blog/fitter-happier/ (27 October 2020.)
Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software, 59(10). 1-23.

See also: Poster