Search code examples
pythonnlpnltksimilaritycorpus

The similar method from the nltk module produces different results on different machines. Why?


I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others.

All versions and etc. were the same.

Does anyone know why these differences would occur? Thanks.

Code used at command line.

python
>>> import nltk
>>> nltk.download() #here you use the pop-up window to download texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet

Those lists of terms returned by the similar method differ from user to user, they have many words in common, but they are not identical lists. All users were using the same OS, and the same versions of python and nltk.

I hope that makes the question clearer. Thanks.


Solution

  • In your example there are 40 other words which have exactly one context in common with the word 'monstrous'. In the similar function a Counter object is used to count the words with similar contexts and then the most common ones (default 20) are printed. Since all 40 have the same frequency the order can differ.

    From the doc of Counter.most_common:

    Elements with equal counts are ordered arbitrarily


    I checked the frequency of the similar words with this code (which is essentially a copy of the relevant part of the function code):

    from nltk.book import *
    from nltk.util import tokenwrap
    from nltk.compat import Counter
    
    word = 'monstrous'
    num = 20
    
    text1.similar(word)
    
    wci = text1._word_context_index._word_to_contexts
    
    if word in wci.conditions():
                contexts = set(wci[word])
                fd = Counter(w for w in wci.conditions() for c in wci[w]
                              if c in contexts and not w == word)
                words = [w for w, _ in fd.most_common(num)]
                # print(tokenwrap(words))
    
    print(fd)
    print(len(fd))
    print(fd.most_common(num))
    

    Output: (different runs give different output for me)

    Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})