Search code examples
nlpcorpuslinguistics

Relationship between vocab size and complexity


I have 2 corpuses, if one has a larger vocabulary size than the other, does it mean its language is more complex?

Apart from complexity of the language, what else can effect the size of the vocabulary in a corpus?


Solution

  • No. Language consists of a lot more than just vocabulary. If the grammatical structures are convoluted, then even a smaller vocabulary can lead to very complex sentences.

    In order to answer the second part properly, you'd need to define first what exactly you mean by 'complexity'. This is not a measure that can easily be quantified (such as, eg, sentence length).

    Most reading comprehension measures combine the length of words and sentences, on the assumption that longer words and longer sentences are harder to understand; however, shorter words tend to have more different meanings, and are arguably harder to understand if their meaning is not clear from the context.

    Update after clarification: The size of the vocabulary depends on various factors, such as:

    1. active vocabulary of the author: if I write a text in my native language (where my vocab is large), the number of different words I use in it will be bigger. If I write in a foreign language where I don't know that many words, it will of course be smaller
    2. the language itself: a bit of an anomaly, but English has a much larger vocabulary than some other languages, due to its history. There are many near-synonyms, so it's easier to use more different word. Other languages are more limited.
    3. topic: this is probably the biggest factor, as a very limited, technical topic will result in a more limited vocab. Wikipedia in general uses a broad range of words, but if you only take the articles on animals, the vocab will be more restricted.
    4. style: similar to (1), I have an influence on the vocab size by how I write. By limiting my vocab, I can make a text more 'plain' (and leave more to the reader's imagination).