So i'm new to python, I want to discover it's potential and get more knowledge about what I can do with it. I did this code to compare CSVs, basically what it does you provide it 2 CSVs, CSV1 that has some id column and a column with values that you want to add to other CSV (CSV2)
Note: This script does exactly what I want, and it seems to work fine, hope it can also be useful to someone, my question is really what I can do to improve it's performance or even make the code cleaner
# Made by Varqas
# CSV1 = CSV containing values that can be matched in CSV2 and a column that will be added
# CSV2 = CSV containing values that can be matched and column that will be concatenated at the end of the CSV (The last column values should be empty)
with open('csv1.csv', encoding="utf8") as check_file:
# Get Column that will be used to Compare values and add it to a list
columnToCompare = list([row.split(',')[0].strip() for row in check_file])
with open('csv1.csv', encoding="utf8") as check_file:
# Get Column that will be used to add to a row values and add it to a list
columnToAdd = list([row.split(',')[2].strip() for row in check_file])
with open('csv2.csv', 'r', encoding="utf8") as in_file, open('out.csv', 'w', encoding="utf8") as out_file:
i = 0
# For each Row in CSV2
for line in in_file:
# Write Headers
if i == 0:
out_file.write(line)
else:
# GET Column on CSV2 containing value that will be compared on CVS1
value = line.split(',')[1].strip()
# Check if first Column value on CSV2 either variable is in
if value in columnToCompare:
# Check for duplicates in the list
numberOfOccurences = list(columnToCompare).count(value)
concatRow = ""
if numberOfOccurences > 1:
# Concatenate all values of occurences
for x in range(numberOfOccurences):
index = list(columnToCompare).index(value)
concatRow = concatRow + columnToAdd[index]
if x != numberOfOccurences - 1:
concatRow = concatRow + " + "
# Remove value so list.index doesn't found same row
columnToCompare[index] = ""
else:
# Add other row that doesn't match
index = list(columnToCompare).index(value)
concatRow = columnToAdd[index]
# Concat to last column of CSV2
out_file.write(line.strip() + concatRow + "\n")
else:
# Still concat value in CSV2 to last column if not found in csv1
out_file.write(line.strip() + "not found" + "\n")
i = i + 1
I know it can be improved, and minified perhaps using some libs... Let me know what you think!
I tried using pd merge, but I didn't understand quite well how I could add concatenation and values inside it.
You can use the Pandas library to read the two CSV files into dataframes and merge the two columns into the second CSV and output a new CSV with the merged columns.
import pandas as pd
# read first CSV
df1 = pd.read_csv('first.csv')
# read second CSV
df2 = pd.read_csv('second.csv')
# merge the id column and a "column with values
# that you want to add to other CSV (CSV2)"
# for the example the second column is named 'data'.
merged_df = pd.merge(df2, df1[['id', 'data']], on='id', how='left')
# save new dataframe to csv.
merged_df.to_csv('merged.csv', index=False)