Monday, April 13, 2009

How many words are there in a language?

In a recent discussion, the question came up of whether a language's vocabulary could be tallied (briefly addressed at Language Log a while back, and at FEL.) I have no firm answer to that (and it's logically independent of whether or not you can estimate the proportion of the vocabulary coming from a given language - that's a sampling problem.) But, notwithstanding the bizarre if occasionally entertaining acrimony of that discussion, it's actually a rather interesting question.

Clearly, any given speaker of a language - and hence any finite set of speakers - can know only a finite number of morphemes, even if you include proper names, nonce borrowings, etc. ("Words" is a different matter - if you choose to define compounds as words, some languages in principle have productive systems defining potentially infinitely many words. The technical vocabulary of chemists in English is one such case, if I recall rightly.) Equally clearly, it's practically impossible to be sure that you've enumerated all the morphemes known by even a single speaker, let alone a whole community; even if you trust (say) the OED to have done that for some subset of English speakers (which you probably shouldn't), you're certainly not likely to find any dictionary that comprehensive for most languages. Does that mean you can't count them?

Not necessarily. You don't always have to enumerate things to estimate how many of them there are, any more than a biologist has to count every single earthworm to come up with an earthworm population estimate. Here's one quick and dirty method off the top of my head (obviously indebted to Mandelbrot's discussion of coastline measurement):
  • Get a nice big corpus representative of the speech community in question. ("Representative" is a difficult problem right there, but let's assume for the sake of argument that it can be done.)
  • Find the lexicon size required to account for the 1st page, then the first 2 pages, then the first 3, and so on.
  • Graph the lexicon size for the first n pages against n.
  • Find a model that fits the observed distribution.
  • See what the limit as n tends to infinity of the lexicon size, if any, would be according to this model.


A bit of Googling reveals that this rather simplistic idea is not original. On p. 20 of An Introduction to Lexical Statistics, you can see just such a graph. An article behind a pay wall (Fan 2006) has an abstract indicating that for large enough corpora you get a power law.

But if it's a power law, then (since the power obviously has to be positive) that would predict no limit as n tends to infinity. How can that be, if, for the reasons discussed above, the lexicon of any finite group of speakers must be finite? My first reaction was that that would mean the model must be inapplicable for sufficiently large corpus sizes. But actually, it doesn't imply that necessarily: any finite group of speakers can also only generate a finite corpus. If the lexicon size tends to infinity as the corpus size does, then that just means your model predicts that, if they could talk for infinitely long, your speaker community would eventually make up infinitely many new morphemes - which might in some sense be a true counterfactual, but wouldn't help you estimate what the speakers actually know at any given time. In that case, we're back to the drawing board: you could substitute in a corpus size corresponding to the estimated number of morphemes that all speakers in a given generation would use in their lifetimes, but you're not going to be able to estimate that with much precision.

The main application for a lexicon size estimate - let's face it - is for language chauvinists to be able to boast about how "ours is bigger than yours". Does this result dash their hopes? Not necessarily! If the vocabulary growth curve for Language A turns out to increase faster with corpus size than the vocabulary growth curve for Language B, then for any large enough comparable pair of samples, the Language A sample will normally have a bigger vocabulary than the Language B one, and speakers of Language A can assuage their insecurities with the knowledge that, in this sense, Language A's vocabulary is larger than Language B's, even if no finite estimate is available for either of them. Of course, the number of morphemes in a language says nothing about its expressive power anyway - a language with a separate morpheme for "not to know", like ancient Egyptian, has a morpheme for which English has no equivalent morpheme, but that doesn't let it express anything English can't - but that's a separate issue.

OK, that's enough musing for tonight. Over to you, if you like this sort of thing.

No comments:

Post a Comment