Influencers, rappers, politicians, or what's authentic internet register? A study on Polish and Czech YouTube comments
2022-04-12, 15:00–15:30 (Europe/Vienna), Room 5

Every day, YouTube users upload about 720,000 hours of video material. In most cases, the public communication around these videos does not occur in one direction: from the sender to the receiver, from the video creator to the audience. Registered users have the opportunity to comment on the videos. The communication context is crucial here, particularly the comments under the videos fulfill the referential function in this situation. Users comment publicly under most of these videos, giving linguistics a rich source of authentic digital age language material. These circumstances make YouTube the most dynamic field for observing, describing, and analyzing synchronic changes in language variations.

We observe a special language register on YouTube - the internet register. Based on Biber, the register is a "term for any language variety defined by its situational characteristics, including the speaker's purpose, the relationship between speaker and hearer, and the production circumstances" (Biber 2009: 823). While the registers of academic language, literary language, or spoken language have already been researched relatively well, the internet register shows deficits. The reasons for this are diverse - it is a very new and changing field that requires digital natives' skills to find and research the relevant content and language varieties.

The study aims to investigate two key questions for the Polish and Czech internet register:
1. How does the internet register differ from the registers for spoken and written language?
2. Are there any language-dependent and national differences in language usage on YouTube?

A total of around 400,000 tokens of comments were collected for the study using various data mining methods. The videos under which the comments were dug out were not chosen arbitrarily but according to the current official national YouTube trends in Poland and the Czech Republic. In two samples, comments were collected under the most popular videos in ten categories: cars, comedy, fashion & lifestyle, gaming, music, non-political interview, politics, report, sports, and video blog. The samples for the two focus languages were taken from two of the most popular videos per each of the ten categories to ensure analysis' objectivity. The data collection was also carried out at two different time slots in 2021 so that in the end, comments of 40 videos per language and a total of 80 videos are available in the corpus.

The corpus data were annotated according to their grammatical categories and should reveal more about the internet register through the POS frequencies, sentence structure, and sentence length frequencies. In addition, the comments are stylistically clustered in R to make statistical regularities and dependencies in linguistic usage visible and better understandable.

The study aims to expand the term register, using Polish and Czech as an example, to include language on the internet and distinguish between authentic and non-authentic internet registers.


Aleksej Tikhonov is a researcher of the Department of Slavonic and Hungarian Studies at the Humboldt University of Berlin. He completed his Ph.D. on the linguistic author identification of Rixdorf manuscripts in summer 2020 and defended his dissertation in spring 2021. His research interests are East and West Slavonic languages, corpus linguistics, digital humanities, (semi-)automatic text recognition, Slavonic languages in German popular music in the 21st century, linguistic integration of minorities, and multilingual language contact in urban areas. Since spring 2021 Aleksej Tikhonov is working as PostDoc of the UK-German Collaborative Research Project “The History of Pronominal Subjects in the Languages of Northern Europe” between the Humboldt University of Berlin (head: Roland Meyer) and the University of Oxford (head: David Willis).