My paper with Adam Callahan on the almost power-law behavior of the type-token ratio in literary texts has been submitted for publication and is available here:callahan_davis_heavy_tails_final
In this paper we look in some detail at the behavior of the average number of new words in the first words of a text, as a function of . Consider replacing each word in a text by a 1 if that word has not appeared previously in the text, and by a 0 if it has appeared previously. The type-token ratio is the average of the 1’s and 0’s up to . The sequences of 1’s and 0’s we get this way are not independent, and is not equivalent to a sequence of Bernoulli trial. The behavior of as a function of is that it is almost described by a power law. Basically, a regression of on gives a straight line with, typically, . This means there are constants such that . When we plot we should get, approximately, a constant. However we do not. Typically we get a plot that, as a function of , increases relatively rapidly to a maximum and then decreases very slowly.
A real-valued function of a positive integer variable is slowly varying if as , for some number . (This is a “soft analysis” definition in Terry Tao’s sense, I believe. In the applications we have in mind it would definitely help to make this a “hard analysis” definition).
Not only, is a slowly varying function, but – with as above – has a variance that is slowly varying and decaying to 0. We call functions that are slowly varying with a variance that decays to 0 as a power function, ultra-slowly varying. So what we find for literary texts of a variety of lengths, languages and genres, is that there is an index (dependent on the text) such that is ultra-slowly varying. If the ultra-slowly varying function were a constant we would have a genuine power law, but it isn’t and we don’t.
We find that for the literary texts we examined, there is a point in the text, which we call a turn-over point, beyond which the type-token ratio is very well described by a power law (typically, ) and before which, is better described as a function of the form .
In this paper we also consider the relative frequency of occurrence of a word in the first words of a text as an estimate of the word’s probability of occurrence in the text, and consider the Shannon entropy of the first words of text. Typically, increases logarithmically with .
See also: Physics of Text.