Search code examples
pythonbashcsvduplicatesmerging-data

merge csv files with different column order remove duplicates


I have multiple CSV files with same number of columns BUT different column orders in each , I wanted to merge them removing duplicates, all of the other solutions here dont consider column order hence merging output is incorrect, Hence how to do it in either windows commandline(e.g logparser) or bash?

Also python script to achieve this would also do.


Solution

  • The following script works properly if:

    • csv aren't too big (i.e. can be loaded in memory)
    • the first row of the CSV contains the column names

    You only have to fill files and final_headers

    import csv
    
    files = ['c1.csv', 'c2.csv', 'c3.csv']
    final_headers = ['col1', 'col2', 'col3']
    
    merged_rows = set()
    for f in files:
        with open(f, 'rb') as csv_in:
            csvreader = csv.reader(csv_in, delimiter=',')
        headers = dict((h, i) for i, h in enumerate(csvreader.next()))
            for row in csvreader:
                merged_rows.add(tuple(row[headers[x]] for x in final_headers))
    with open('output.csv', 'wb') as csv_out:
        csvwriter = csv.writer(csv_out, delimiter=',')
        csvwriter.writerows(merged_rows)