Search code examples

How to Insert Huge Pandas Dataframe in MySQL table with Parallel Insert Statement?

I am working on a project where I have to write a data frame with Millions of rows and about 25 columns mostly of numeric type. I am using Pandas DataFrame to SQL Function to dump the dataframe in Mysql table. I have found this function creates an Insert statement that can insert multiple rows at once. This is a good approach but MySQL has a limitation on the length of query that can be built using this approach.

Is there a way such that insert that in parallel in the same table so that I can speed up the process?


  • You can do a few things to achieve that.

    One way is to use an additional argument while writing to sql.

    df.to_sql(method = 'multi')

    According to this documentation, passing 'multi' to method argument allows you to bulk insert.

    Another solution is to construct a custom insert function using multiprocessing.dummy. here is the link to the documentation :

    import math
    from multiprocessing.dummy import Pool as ThreadPool
    def insert_df(df, *args, **kwargs):
        nworkers = 4 # number of workers that executes insert in parallel fashion
        chunk = math.floor(df.shape[0] / nworkers) # number of chunks
        chunks = [(chunk * i, (chunk * i) + chunk) for i in range(nworkers)]
        chunks.append((chunk * nworkers, df.shape[0]))
        pool = ThreadPool(nworkers)
        def worker(chunk):
            i, j = chunk
            df.iloc[i:j, :].to_sql(*args, **kwargs)
   , chunks)
    insert_df(df, "foo_bar", engine, if_exists='append')

    The second method was suggested at