I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them
import pandas
concatDf=pandas.read_csv("C:\\OUT\\Concat EPC3.csv")
nodupl=concatDf.drop_duplicates()
nodupl.to_csv("C:\\OUT\\Concat EPC3- NoDupl.csv",index=0)
low_memory=False
However this runs me into a MemoryError. My ram is 16gb and can't go any higher. Is there a more efficient way of removing duplicates that perhaps does it chunks without me having to break up the csv file into smaller files?
Essentially the same idea as zwer, but checking for equality in rows with the same hash (instead of automatically discarding duplicated hashes).
file_in = "C:\\OUT\\Concat EPC3.csv"
file_out = "C:\\OUT\\Concat EPC3- NoDupl.csv"
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
# Skip header
next(f_in)
# Find duplicated hashes
hashes = set()
hashes_dup = {}
for row in f_in:
h = hash(row)
if h in hashes:
hashes_dup[h] = set()
else:
hashes.add(h)
del hashes
# Rewind file
f_in.seek(0)
# Copy header
f_out.write(next(f_in))
# Copy non repeated lines
for row in f_in:
h = hash(row)
if h in hashes_dup:
dups = hashes_dup[h]
if row in dups:
continue
dups.add(row)
f_out.write(row)