I am facing the following challenges
I have approximately 400 files which i have to consolidate into one master file but there is one problem that the files have different headers and when I try to consolidate it put the data into different rows on the basis of column
Example:- lets say i have two files C1 and C2 file C1.csv
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
and file C2.csv
name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3
so I have these two files and from these files I want that when I consolidate these files the output will be like this:-
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3
for Consolidation I am using the following code
import glob
import pandas as pd
directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist: # Iterate through each of the 300 files
df1 = pd.read_csv(file) # create df using the file
df1col = list (df1.columns) # save columns to a list
df2 = consolidated # set the consolidated as your df2
df2col = list (df2.columns) # save columns from consolidated result as list
commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
# print (commoncol)
if commoncol == []: # In first iteration, consolidated file is empty, which will return in a blank df
consolidated = pd.concat([df1, df2], axis=1).fillna(value=0) # concatenate (outer join) with no common columns replacing null values with 0
else:
consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0) # merge both df specifying the common column and replace null values with 0
# print (consolidated) << Optionally, check the consolidated df at each iteration
# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)
but it can't merge the columns having same data like the output shown earlier.
From your two-file example, you know the final (least common) header for the output, and you know what one of the bigger headers looks like.
My take on that is to think of every "other" kind of header as needing a mapping to the final header, like concatenating add-lines 1-3 into a single address field. We can use the csv module to read and write row-by-row and send the rows to the appropriate consolidator (mapping) based on the header of the input file.
The csv module provides a DictReader and DictWriter which makes dealing with fields you know by name very handy; especially, the DictWriter() constructor has the extrasaction="ignore" option which means that if you tell the writer your fields are:
Col1, Col2, Col3
and you pass a dict like:
{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"}
it will just ignore Col4 and only write Cols 1-3:
writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})
# Col1,Col2,Col3
# val1,val2,val3
import csv
def consolidate_add_lines_1_to_3(row):
row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
return row
# Add other consolidators here...
# ...
Final_header = ["name", "phone-no", "address"]
f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()
for fname in ["file1.csv", "file2.csv"]:
f_in = open(fname, newline="")
reader = csv.DictReader(f_in)
for row in reader:
if "add-line1" in row and "add-line2" in row and "add-line3" in row:
row = consolidate_add_lines_1_to_3(row)
# Add conditions for other consolidators here...
# ...
writer.writerow(row)
f_in.close()
f_out.close()
If there are more than one kind of other header, you'll need to seek those out, and figure out the extra consolidators to write, and the conditions to trigger them in for row in reader
loop.