I have a very large text file with duplicate entries which I want to eliminate. I do not care about the order of the entries because the file will later be sorted.
Here is what I have so far:
unique_lines = set()
outfile = open("UniqueMasterList.txt", "w", encoding = "latin-1")
with open("MasterList.txt", "r", encoding = "latin-1") as infile:
for line in infile:
if line not in unique_lines:
outfile.write(line)
unique_lines.add(line)
outfile.close()
It has been running for 30 minutes and has not finished. I need it to be faster. What is a faster approach in Python?
To use the same technique as uniq
, in Python:
import itertools
with open("MasterList.txt", "r", encoding = "latin-1") as infile:
sorted_file = sorted(infile.readlines())
for line, _ in itertools.groupby(sorted_file):
outfile.write(line)
This presumes that the entire file will fit into memory, twice. Or that the file is already sorted and you can skip that step.