Search code examples
pythonpandascsvgroupingdistinct-values

Removing duplicate records from CSV file using Python Pandas


I would like to remove duplicate records from a CSV file using Python Pandas. The CSV file contains records with three attributes, scale, minzoom, and maxzoom. I want to have a resulting dataframe with minzoom and maxzoom and the records left being unique.

I.e.,

Input CSV file (lookup_scales.csv)

Scale, minzoom, maxzoom
2000, 0, 15
3000, 0, 15
10000, 8, 15
20000, 8, 15
200000, 15, 18
250000, 15, 18

Required distinct_lookup_scales.csv (Without scale column)

minzoom, maxzoom
0,5
8,15
15,18

My code so far is

lookup_scales_df = pd.read_csv('C:/Marine/lookup/lookup_scales.csv', names = ['minzoom','maxzoom'])
lookup_scales_df = lookup_scales_df.set_index([2, 3])
file_name = "C:/Marine/lookup/distinct_lookup_scales.csv"
lookup_scales_df.groupby('minzoom', 'maxzoom').to_csv(file_name, sep=',')

Solution

  • You don't need NumPy or anything. You can just do the unique-ify in one line, while importing the CSV file using Pandas:

    import pandas as pd
    df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()
    

    Output:

       minzoom  maxzoom
    0        0       15
    1        8       15
    2       15       18
    

    Then to write it out to a CSV file:

    df.to_csv(file_name, index=False) # You don't need to set sep in this because to_csv makes it comma-delimited.
    

    So the whole code:

    import pandas as pd
    df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()
    file_name = "C:/Marine/lookup/distinct_lookup_scales.csv"
    df.to_csv(file_name, index=False) # You don't need to set sep in this, because to_csv makes it comma-delimited.