I'm trying to merge CSV files and putting them into an SQLite table. It works but after I do that once I can't update my table without duplicating rows. I can add new merge columns, but not after the last row.
Reading CSV files and getting the specific columns :
import pandas as pd
import sqlite3
pd.set_option('display.max_columns', 6)
dia1 = pd.read_csv('dia0705.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia2 = pd.read_csv('dia0712.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia1['CPF/CNPJ'] = dia1['CPF/CNPJ'].astype(str)
dia2['CPF/CNPJ'] = dia2['CPF/CNPJ'].astype(str)
dia1['NOME ACIONISTA'] = dia1['NOME ACIONISTA'].astype(str)
dia2['NOME ACIONISTA'] = dia2['NOME ACIONISTA'].astype(str)
Then I use the merge command and rename columns to match my SQLite table :
merge1 = pd.merge(dia1, dia2, how='outer', on=["NOME ACIONISTA", "CPF/CNPJ"],) # indicator=True)
merge1.rename(columns={"EO_x": "dia0705"}, inplace=True)
merge1.rename(columns={"EO_y": "dia0712"}, inplace=True)
merge1.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
merge1.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True
Repeating because I want to merge multiple CSV files :
merge2 = pd.merge(merge1, dia3, how='outer', on=["Nome_Acionista", "CPF_CNPJ"],)
dia4 = pd.read_csv('220913_completo.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia4.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
dia4.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True)
dia4.rename(columns={"EO": "dia0913"}, inplace=True)
merge3 = pd.merge(merge2, dia4, how='outer', on=["Nome_Acionista", "CPF_CNPJ"],)
dia4 = pd.read_csv('220913_completo.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia4.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
dia4.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True)
dia4.rename(columns={"EO": "dia0913"}, inplace=True)
If I connect this to my SQLite table it works :
connection = sqlite3.connect('2022.db')
c = connection.cursor()
merge3.to_sql(
name='acoes',
con=connection,
if_exists='append',
index=False,
)
My table contains columns name_ID
, Number_ID
and 365 other for each day of the year.
As everything is almost identical for each file, you can loop through the files. Below I have a few specifics:
.csv
) to loop through.Some suggestions for how these could be changed:
os
and/or glob
could be used to loop through files within a directory (filtering can be included in this, for example for only those ending in .csv
.Also note that the first file has been added before the loop. If looping through the files in a folder, an if
statement could be included, along the lines of "if this is the first file (maybe merged_data = pd.DataFrame(); if pd.DataFrame().empty: ...) then create the DataFrame."
import pandas as pd
files = ["file1.csv", "file2.csv"] # list of file names
# otherwise, you could use os and/or glob to loop through the files in a folder
# the first file, don't include this in the list above. If part of a loop, could add in an enumeration and if == first item...
merged_data = pd.read_csv('dia0705.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
merged_data['CPF/CNPJ'] = merged_data['CPF/CNPJ'].astype(str)
merged_data['NOME ACIONISTA'] = merged_data['NOME ACIONISTA'].astype(str)
merged_data.rename(columns={"NOME ACIONISTA": "Nome_Acionista",
"CPF/CNPJ": "CPF_CNPJ",
"EO": "dia0705"}, inplace=True)
# loop through all files in list
for file in files:
# read_csv and select specific columns
to_merge = pd.read_csv(file, header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
# change column names
to_merge.rename(columns={"NOME ACIONISTA": "Nome_Acionista",
"CPF/CNPJ": "CPF_CNPJ",
"EO": file.split(".")[0]}, inplace=True)
# merge the merged_data and to_merge DataFrames
merged_data = pd.merge(merged_data, to_merge,
how='outer',
on=["Nome_Acionista", "CPF_CNPJ"],)