Search code examples
pythonpython-3.xtextduplicates

Faster way to remove duplicates from a very large text file in Python?


I have a very large text file with duplicate entries which I want to eliminate. I do not care about the order of the entries because the file will later be sorted.

Here is what I have so far:

unique_lines = set()
outfile = open("UniqueMasterList.txt", "w", encoding = "latin-1")

with open("MasterList.txt", "r", encoding = "latin-1") as infile:
    for line in infile:
        if line not in unique_lines:
            outfile.write(line)
            unique_lines.add(line)

outfile.close()

It has been running for 30 minutes and has not finished. I need it to be faster. What is a faster approach in Python?


Solution

  • To use the same technique as uniq, in Python:

    import itertools
    with open("MasterList.txt", "r", encoding = "latin-1") as infile:
        sorted_file = sorted(infile.readlines())
    for line, _ in itertools.groupby(sorted_file):
        outfile.write(line)
    

    This presumes that the entire file will fit into memory, twice. Or that the file is already sorted and you can skip that step.