Heavy tails of distributions of words in literary texts

My paper with Adam Callahan on the almost power-law behavior of the type-token ratio in literary texts has been submitted for publication and is available here:callahan_davis_heavy_tails_final

In this paper we look in some detail at the behavior of the average number of new words in the first n words of a text, as a function of n. Consider replacing each word in a text by a 1 if that word has not appeared previously in the text, and by a 0 if it has appeared previously. The type-token ratio \rho(n) is the average of the 1′s and 0′s up to n. The sequences of 1′s and 0′s we get this way are not independent, and is not equivalent to a sequence of Bernoulli trial. The behavior of \rho(n) as a function of n is that it is almost described by a power law. Basically, a regression of \log(\rho(n)) on log(n) gives a straight line with, typically, r^2\geq 0.95. This means there are constants A, d such that \rho(n) \approx \frac{A}{n^d}. When we plot \rho(n)n^d we should get, approximately, a constant. However we do not. Typically we get a plot that, as a function of n, increases relatively rapidly to a maximum and then decreases very slowly.

A real-valued function f of a positive integer variable is slowly varying if \frac{f(mn)}{f(m)} \to n^k as m \to \infty, for some number k. (This is a “soft analysis” definition in Terry Tao’s sense, I believe. In the applications we have in mind it would definitely help to make this a “hard analysis” definition).

Not only, is \rho(n) a slowly varying function, but \rho(n)n^d – with d as above – has a variance that is slowly varying and decaying to 0. We call functions that are slowly varying with a variance that decays to 0 as a power function, ultra-slowly varying. So what we find for literary texts of a variety of lengths, languages and genres, is that there is an index d (dependent on the text) such that \rho(n)n^d is ultra-slowly varying. If the ultra-slowly varying function were a constant we would have a genuine power law, but it isn’t and we don’t.

We find that for the literary texts we examined, there is a point in the text, which we call a turn-over point, beyond which the type-token ratio \rho(n) is very well described by a power law (typically, r^2 > 0.99) and before which, \rho(n) is better described as a function of the form \frac{\alpha+\beta log(n)}{n^d}.

In this paper we also consider the relative frequency of occurrence of a word in the first n words of a text as an estimate of the word’s probability of occurrence in the text, and consider the Shannon entropy H(n) of the first n words of text. Typically, H(n) increases logarithmically with n.

See also: Physics of Text.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s