Search code examples
data-processingbigdata

Find common words between two big unstructured text files


I have two big unstructured text files which can NOT be fitted into the memory. I want to find the common words between them.

What would be the most efficient (time and space) way?

Thanks


Solution

  • I gave this two files:

    pi_poem

    Now I will a rhyme construct
    By chosen words the young instruct
    I do not like green eggs and ham
    I do not like them Sam I am
    

    pi_prose

    The thing I like best about pi is the magic it does with circles.
    Even young kids can have fun with the simple integer approximations.
    

    The code is simple. The first loop reads the first file, line by line, sticking the words into a lexicon set. The second loop reads the second file; each word it finds in the first file's lexicon goes into a set of common words.

    Does that do what you need? You'll need to adapt it for punctuation, and you'll probably want to remove the extra printing once you have it changed over.

    lexicon = set()
    with open("pi_poem", 'r') as text:
        for line in text.readlines():
            for word in line.split():
                if not word in lexicon:
                    lexicon.add(word)
    print lexicon
    
    common = set()
    with open("pi_prose", 'r') as text:
        for line in text.readlines():
            for word in line.split():
                if word in lexicon:
                    common.add(word)
    
    print common
    

    Output:

    set(['and', 'am', 'instruct', 'ham', 'chosen', 'young', 'construct', 'Now', 'By', 'do', 'them', 'I', 'eggs', 'rhyme', 'words', 'not', 'a', 'like', 'Sam', 'will', 'green', 'the'])
    set(['I', 'the', 'like', 'young'])