Search code examples
pythonpandassqlitemergedatatable

Merging with Pandas and inserting into SQLite


I'm trying to merge CSV files and putting them into an SQLite table. It works but after I do that once I can't update my table without duplicating rows. I can add new merge columns, but not after the last row.

Reading CSV files and getting the specific columns :

import pandas as pd
import sqlite3

pd.set_option('display.max_columns', 6)
dia1 = pd.read_csv('dia0705.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia2 = pd.read_csv('dia0712.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]

dia1['CPF/CNPJ'] = dia1['CPF/CNPJ'].astype(str)
dia2['CPF/CNPJ'] = dia2['CPF/CNPJ'].astype(str)

dia1['NOME ACIONISTA'] = dia1['NOME ACIONISTA'].astype(str)
dia2['NOME ACIONISTA'] = dia2['NOME ACIONISTA'].astype(str)

Then I use the merge command and rename columns to match my SQLite table :

merge1 = pd.merge(dia1, dia2, how='outer', on=["NOME ACIONISTA", "CPF/CNPJ"],)  # indicator=True)

merge1.rename(columns={"EO_x": "dia0705"}, inplace=True)
merge1.rename(columns={"EO_y": "dia0712"}, inplace=True)
merge1.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
merge1.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True

Repeating because I want to merge multiple CSV files :

merge2 = pd.merge(merge1, dia3, how='outer', on=["Nome_Acionista", "CPF_CNPJ"],)

dia4 = pd.read_csv('220913_completo.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia4.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
dia4.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True)
dia4.rename(columns={"EO": "dia0913"}, inplace=True)

merge3 = pd.merge(merge2, dia4, how='outer', on=["Nome_Acionista", "CPF_CNPJ"],)

dia4 = pd.read_csv('220913_completo.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia4.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
dia4.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True)
dia4.rename(columns={"EO": "dia0913"}, inplace=True)

If I connect this to my SQLite table it works :

connection = sqlite3.connect('2022.db')
c = connection.cursor()

merge3.to_sql(
        name='acoes',
        con=connection,
        if_exists='append',
        index=False,
    )

My table contains columns name_ID, Number_ID and 365 other for each day of the year.


Solution

  • As everything is almost identical for each file, you can loop through the files. Below I have a few specifics:

    1. There is a manually written list of file names (with .csv) to loop through.
    2. The column renaming from "EO" is to the name of the file (without file type).

    Some suggestions for how these could be changed:

    1. Instead of a list that requires manually writing, os and/or glob could be used to loop through files within a directory (filtering can be included in this, for example for only those ending in .csv.
    2. If there are different names then a dictionary could be created to map the file name and file together. Alternatively, another list which would be in the same order and the file list. This would be more difficult if looping through the files in the directory.

    Also note that the first file has been added before the loop. If looping through the files in a folder, an if statement could be included, along the lines of "if this is the first file (maybe merged_data = pd.DataFrame(); if pd.DataFrame().empty: ...) then create the DataFrame."

    import pandas as pd
    
    files = ["file1.csv", "file2.csv"] # list of file names
    # otherwise, you could use os and/or glob to loop through the files in a folder
    
    # the first file, don't include this in the list above.  If part of a loop, could add in an enumeration and if == first item...
    merged_data = pd.read_csv('dia0705.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
    merged_data['CPF/CNPJ'] = merged_data['CPF/CNPJ'].astype(str)
    merged_data['NOME ACIONISTA'] = merged_data['NOME ACIONISTA'].astype(str)
    
    merged_data.rename(columns={"NOME ACIONISTA": "Nome_Acionista",
                         "CPF/CNPJ": "CPF_CNPJ",
                         "EO": "dia0705"}, inplace=True)
    
    # loop through all files in list
    for file in files:
        # read_csv and select specific columns
        to_merge = pd.read_csv(file, header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
        # change column names
        to_merge.rename(columns={"NOME ACIONISTA": "Nome_Acionista",
                                 "CPF/CNPJ": "CPF_CNPJ",
                                 "EO": file.split(".")[0]}, inplace=True)
        # merge the merged_data and to_merge DataFrames
        merged_data = pd.merge(merged_data, to_merge,
                               how='outer',
                               on=["Nome_Acionista", "CPF_CNPJ"],)