Search code examples
pythoncsvspreadsheetopencsv

Removing duplicates between multiple CSV files


I have multiple CSV files with two columns in each of these CSV files:

  1. Links (Column A)
  2. Description (Column B)

I don't know what the best way would be to remove all duplicates of a link and description when found, leaving only one, so that there is only one instance of the link and description left. It would be best if I could import all of the CSV files at once, there is a possibility that one link appears in multiple CSV files. The link and description is there is a duplicate would be EXACTLY the same. Thanks!


Solution

  • This can be done by doing a pd.concat followed by drop_duplicates.

    import pandas as pd
    
    df1 = pd.read_csv('path/to/file1.csv')
    df2 = pd.read_csv('path/to/file2.csv')
    
    df = pd.concat([df1, df2]).drop_duplicates().reset_index(drop=True)
    

    Please refer to the stackoverflow answer here to understand more.