Search code examples
pythonduplicateslarge-data

Removing duplicates on very large datasets


I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them

import pandas

concatDf=pandas.read_csv("C:\\OUT\\Concat EPC3.csv")
nodupl=concatDf.drop_duplicates()
nodupl.to_csv("C:\\OUT\\Concat EPC3- NoDupl.csv",index=0)
low_memory=False  

However this runs me into a MemoryError. My ram is 16gb and can't go any higher. Is there a more efficient way of removing duplicates that perhaps does it chunks without me having to break up the csv file into smaller files?


Solution

  • Essentially the same idea as zwer, but checking for equality in rows with the same hash (instead of automatically discarding duplicated hashes).

    file_in = "C:\\OUT\\Concat EPC3.csv"
    file_out = "C:\\OUT\\Concat EPC3- NoDupl.csv"
    
    with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
        # Skip header
        next(f_in)
        # Find duplicated hashes
        hashes = set()
        hashes_dup = {}
        for row in f_in:
            h = hash(row)
            if h in hashes:
                hashes_dup[h] = set()
            else:
                hashes.add(h)
        del hashes
        # Rewind file
        f_in.seek(0)
        # Copy header
        f_out.write(next(f_in))
        # Copy non repeated lines
        for row in f_in:
            h = hash(row)
            if h in hashes_dup:
                dups = hashes_dup[h]
                if row in dups:
                    continue
                dups.add(row)
            f_out.write(row)