Search code examples
pythoncsvmultiple-columnslarge-datalarge-files

Merge multiple csv files with different columns in python error with writerow


I have a large number of csv files/dataframes that are too large to store together in memory. However, I noticed the size of the columns are different between these dataframes. My columns are permutations of "ACGT" (DNA Sequences). I followed the instructions from this question on how to write multiple csvs with different columns, however I get the following error: AttributeError: 'str' object has no attribute 'keys'. I found this question to address the error, however I am unsure where to edit the code to make the 'line' object a dictionary. I am also worried my csv files which have an index column without a header value may be messing up my code or the format of my fieldnames (str derived from permutations) may be an issue. If there is a way to concat multiple csv files with different in another language I am amendable to that however I have run into issues with this question as well.

import glob
import csv
import os

mydir = "test_csv/"

file_list = glob.glob(mydir + "/*.csv") # Include slash or it will search in the wrong directory!!
file_list

import itertools
fieldnames = []
for p in itertools.product('ACGT', repeat=8):
    fieldnames.append("".join(p))


for filename in file_list:
    with open(filename, "r", newline="") as f_in:
        reader = csv.reader(f_in)
        headers = next(reader)
with open("Outcombined.csv", "w", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=fieldnames)
    for filename in file_list:
        with open(filename, "r", newline="") as f_in:
            reader = csv.DictReader(f_in)
            for line in headers:
                writer.writerow(line)

Solution

  • You only need to write the header once, so do that before your file_list loop:

    with open('Outcombined.csv','w',newline='') as f_out: 
        writer = csv.DictWriter(f_out,fieldnames=fieldnames) 
        writer.writeheader() # write header based on `fieldnames`
        for filename in file_list: 
            with open(filename,'r',newline='') as f_in: 
            reader = csv.DictReader(f_in) 
            for line in reader: 
                writer.writerow(line) 
    

    The DictWriter will take care of placing the values under the correct headers.