Search code examples
pythonsimilaritytext-processing

Filter Textfile for similar lines


In a Textfile with a lot of lines I need to extract all lines which start with similar words and are not unique. I look for those lines which start off the same – they might have the same content (duplicate lines) or a slightly different content (after the first word). I hope this example explains it. This would be an example out of such a file:

hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.

I’m looking for those lines:

hungarian-american
hungarian-german lied ms
hungarian-german ms
hungarian-speaking areas
hungarian-speaking regions
hungary
hungary and slovakia
hungary and slovakia

Discarded in this example are

hungarica
hungry i
hunnis, william
hunt, l.

because they are unique (the don’t start off with similar words).

How could I try to tackle this problem? I’m somewhat familiar with Python and Regular Expressions but perhaps there’s a soultion much simpler? Thanks for your help!


Solution

  • This should do the trick :

    import re
    from collections import defaultdict
    
    dic = defaultdict(list)
    
    lines = """hungarian-american
    hungarian-german lied
    hungarian-german
    hungarian-speaking areas
    hungarian-speaking regions
    hungarica
    hungary
    hungary and slovakia
    hungary and slovakia
    hungry i
    hunnis, william
    hunt, l.""".split('\n')
    
    for line in lines:
        # you should preferably use a word tokenizer such as the ones availables in NTLK
        # but this line gives the idea
        try:
            first_word = re.split(',|;|-|\s', line)[0]
        except IndexError:
            continue
        # Grouping similar lines
        dic[first_word].append(line)
    
    # Showing only similar lines which are not unique :
    for word, lst in dic.items():
        if len(lst) > 1:
            print '\n'.join(lst)