Filter Textfile for similar lines

In a Textfile with a lot of lines I need to extract all lines which start with similar words and are not unique. I look for those lines which start off the same – they might have the same content (duplicate lines) or a slightly different content (after the first word). I hope this example explains it. This would be an example out of such a file:

hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.

I’m looking for those lines:

hungarian-american
hungarian-german lied ms
hungarian-german ms
hungarian-speaking areas
hungarian-speaking regions
hungary
hungary and slovakia
hungary and slovakia

Discarded in this example are

hungarica
hungry i
hunnis, william
hunt, l.

because they are unique (the don’t start off with similar words).

How could I try to tackle this problem? I’m somewhat familiar with Python and Regular Expressions but perhaps there’s a soultion much simpler? Thanks for your help!

Solution

This should do the trick :

import re
from collections import defaultdict

dic = defaultdict(list)

lines = """hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.""".split('\n')

for line in lines:
    # you should preferably use a word tokenizer such as the ones availables in NTLK
    # but this line gives the idea
    try:
        first_word = re.split(',|;|-|\s', line)[0]
    except IndexError:
        continue
    # Grouping similar lines
    dic[first_word].append(line)

# Showing only similar lines which are not unique :
for word, lst in dic.items():
    if len(lst) > 1:
        print '\n'.join(lst)