A remarkable trait of human communication is that universal patterns can be found regardless of the language being used. Zipf's law is the most famous example - when words are ranked according to how frequently they are used, a power-law relationship is observed between the rank of the word and its frequency. Language, however, like any complex system, is not characterized simply by the properties of its individual components, but by the interactions that occur between them; words, in this case are the simple building blocks of complex grammatical structures that ultimately become the components of a system of communication fundamental to the functioning of society.
In this talk we will consider the dynamics of higher-order linguistic structures and how they relate to their constituent parts. We ask whether these two levels of organization change over time at the same pace, whether the dynamics at one level are reducible to those at a lower level, and what can this tells us about the evolution of language and of cultural change.
Data and Methods. We use the Google Books N-grams dataset. This contains an estimated 4% of all books printed throughout the world until 2009. From these data we measured the frequencies per year of words (1-grams), pairs of words (2-grams), up until N-grams with N = 5 for six Indo-European languages between 1855 and 2009. Using this data we create a model language which retains the empirical frequencies of individual words but disregards the grammatical structure of language. We use the rank diversity, a measure introduced in , to compare the rate of change N-grams usage in the random model to those in the empirical data.
Results. The dynamics of higher order linguistic structures cannot be fully described in terms of the dynamics their component parts; while the ranking of words in the model remains identical to that of the empirical data, higher-order word combinations exhibit greater stability. This is contrary to what is found in the empirical data, in which rank diversity increases as with longer word combinations (higher N). Furthermore, we find that most 2-grams, particularly those that are most frequently used, occur more often than in the model. This trend is a consequence of the tendency of particular words to follow particular others; a phenomenon that can be quantified by the entropy of the set of words that may follow any other; and whose distribution shows a remarkable universality across the six European languages we analyzed .