Search code examples
pythonpandasdataframesqlalchemypandas-to-sql

(dataframe.to_sql with reference_or_insert): How to automatically insert a missing record in a referenced table when a foreign key is not found?


Description

I am trying to migrate data from a Pandas DataFrame to a MySQL database table but that data has some inconsistencies that I want to work around though I have not yet figured out a way to. Any help in figuring this out will be very much appreciated.


Example of the data I have:

user_type (table)

code detail
a Secretary
b Accountant

user_df (DataFrame with the data I want to migrate to the user table)

id name user_type_code (FK: user_type)
1 Jane Doe a
2 John Doe a
3 James Doe b
4 Jeff Doe c
5 Jennifer Doe d

As you can notice from the above data, the user_type_code with values c & d cannot be found in the user_type table.

What I want to achieve is to automatically insert those user_type missing data with dummy information to accommodate for the need of being corrected in the future and keep all the user records.

user_type table (how I want it to be at the end)

code detail
a Secretary
b Accountant
c Unknown c
d Unknown d



My Current Implementation

import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.dialects.mysql import insert
from sqlalchemy.exc import NoReferenceError

# I want to add an implementation of inserting the dummy data in the referenced table (user_type) in this function
def insert_ignore_on_duplicates(table, conn, keys, data_iter):
    """ Insert ignore on duplicate primary keys """
    try:
        insert_stmt = insert(table.table).values(list(data_iter))
        on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(
            insert_stmt.inserted
        )
        conn.execute(on_duplicate_key_stmt)
    except NoReferenceError as error:
        print("Error: {}".format(error))

db_engine = create_engine("mysql+mysqlconnector://username:password@localhost:3306/")

user_df = pd.DataFrame()  # Assume this contains all the users' data

user_df.to_sql(
    "user",
    con=db_engine,
    if_exists="append",
    index=False,
    method=insert_ignore_on_duplicates,
    chunksize=5000,
)

I am seeking help to figure out how this insert_ignore_on_duplicates function/method can be modified to allow the automatic insertion of missing foreign key references or any other approach that can perform that.


Some Related Questions I Found


P.S. The reason why I need this implementation is because the data is large (>4 million records) and it contains numerous foreign keys that are not present hence cannot practically be checked manually. Adding these primary dummy data will help in keeping all the data and allowing suitable correction in the future, perhaps updating the record c: Unknown c to c: Auditor


Solution

  • What you really need is a list of the missing codes in the user_type table. You can get that like this:

    import pandas as pd
    
    # example data
    user_type = pd.DataFrame(
        [("a", "Secretary"), ("b", "Accountant")], columns=["code", "detail"]
    )
    # (the above would actually be retrieved via `pd.read_sql_table("user_type", engine)`)
    user_df = pd.DataFrame(
        [
            (1, "Jane Doe", "a"),
            (2, "John Doe", "a"),
            (3, "James Doe", "b"),
            (4, "Jeff Doe", "c"),
            (5, "Jennifer Doe", "d"),
        ],
        columns=["id", "name", "user_type_code"],
    )
    
    # real code starts here
    user_type_code_list = user_type["code"].unique()
    user_df_code_list = user_df["user_type_code"].unique()
    user_types_to_add = pd.DataFrame(
        [
            (f"{x}", f"Unknown {x}")
            for x in user_df_code_list
            if x not in user_type_code_list
        ],
        columns=["code", "detail"],
    )
    print(user_types_to_add)
    """
      code     detail
    0    c  Unknown c
    1    d  Unknown d
    """
    

    You can then use

    user_types_to_add.to_sql("user_type", db_engine, index=False, if_exists="append")
    

    to add the missing rows to the user_type table, followed by

    user_df.to_sql("user", db_engine, index=False, if_exists="append", …)