Embracing Variability in Natural Language Processing
References –Atkinson, D. (1998). Scientific Discourse in Sociohistorical Context: The Philosophical Transactions of the Royal Society of London, 1675-1975 (1st ed.). Routledge.
Biber, D., & Gray, B. (2016). Grammatical Complexity in Academic English: Linguistic Change in Writing. Cambridge University Press.
Crystal, D. (2002). Language Death. Cambridge University Press.
Degaetano-Ortlieb, S., & Teich, E. (2019). Toward an Optimal Code for Communication: The Case of Scientific English. Corpus Linguistics and Linguistic Theory, 18(1), 175-207.
Ellis, N. C., & Larsen-Freeman, D. (2009). Language as a Complex Adaptive System (Vol. 11). John Wiley & Sons.
Fischer, S., Knappen, J., Menzel, K. and Teich, E. (2020). The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 794–802, Marseille, France. European Language Resources Association.
Halliday, M. A. K., A. McIntosh, and P. Strevens. (1964). The Linguistic Sciences and Language Teaching. Longmans.
Halliday, M.A.K. (1993). On the Language of Physical Science. Chapter 3 in Halliday, M.A.K., & Martin, J.R. (1993). Writing Science: Literacy And Discursive Power (1st ed.). Routledge.
Haspelmath, M. (1999). Why is Grammaticalization Irreversible? Linguistics 37-6. De Gruyter. 1043–1068.
Hawkins, J. A. (2004). Efficiency and Complexity in Grammars. Oxford University Press.
Levshina, N. (2022). Communicative Efficiency: Language Structure and Use. Cambridge University Press.
Matthiessen, C. M. I. M. 2014. Registerial Cartography: Context-based Mapping of Text Types and Their Rhetorical-relational Organization. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computation, Phuket, Thailand, pages 5–26.
Reid, T. B. W. 1956. Linguistics, Structuralism and Philology. Archivum Linguisticum 8: 28–37.
This work is an endeavor towards understanding more broadly how and why languages vary and evolve over time by harnessing the advantages of information-theoretic approaches. As a case study, the primary objective is to investigate the evolutionary trajectory of the English written scientific register (approx. 300 years), with a specific focus on its progressive shift towards enhanced communicative efficiency for expert communication. The concept of "register" is understood as variation according to the situational context (cf. Reid 1956, Halliday et al. 1964), which necessitates comprehensive analysis across various linguistic levels, considering the influence of extra-linguistic factors defining the situational context (cf. Matthiessen 2014).
More generally, languages can be seen as dynamic adaptive systems influenced by a multitude of contextual factors, including societal changes, technological advancements, cultural shifts, and communication needs (cf. Ellis & Larsen-Freeman 2009). As these external conditions change, languages undergo modifications and adjustments to remain relevant and effective in facilitating communication among their users (e.g. Haspelmath 1999, Crystal 2000, Hawkins 2004, Levshina 2022). We adopt information theory as a framework to systematically analyze over time linguistic features at various linguistic levels (lexical, grammatical, dependency), revealing the adaptive mechanisms language employs in response to ever-changing contextual pressures over time while still keeping communicative efficiency. Observations in the English scientific register have shown periods of lexical innovation, subsequent grammatical conventionalization, and evolving dynamics between clause and phrase structures in scientific English (Halliday 1993, Atkinson 1998, Biber & Gray 2016, Degaetano-Ortlieb & Teich 2019). We hypothesize that this progression towards efficiency is one main driving force of change.
In order to substantiate our claims, empirical methods are employed, utilizing the Royal Society Proceedings and Transactions from the RSC Corpus (Fischer et al. 2020), covering 331 years of English scientific writing as the data source. Information-theoretic measures are applied, specifically relative entropy, to detect (dis)similarities in language models at the lexical, grammatical, and dependency levels. Additionally, entropy and surprisal are employed to model changes at the word, sentence, and textual levels, unraveling patterns of paradigmatic shifts and changes in the predictability of words as adaptations for facilitating expert-to-expert communication.
Ultimately, this research contributes to a deeper understanding of language variation and change, which can have valuable implications for enhancing NLP techniques. By refining language models to account for language dynamics and variability, we can improve their interpretability and applicability.