Search code examples
mysqlpandaspandasql

How to Insert Huge Pandas Dataframe in MySQL table with Parallel Insert Statement?


I am working on a project where I have to write a data frame with Millions of rows and about 25 columns mostly of numeric type. I am using Pandas DataFrame to SQL Function to dump the dataframe in Mysql table. I have found this function creates an Insert statement that can insert multiple rows at once. This is a good approach but MySQL has a limitation on the length of query that can be built using this approach.

Is there a way such that insert that in parallel in the same table so that I can speed up the process?


Solution

  • You can do a few things to achieve that.

    One way is to use an additional argument while writing to sql.

    df.to_sql(method = 'multi')
    

    According to this documentation, passing 'multi' to method argument allows you to bulk insert.

    Another solution is to construct a custom insert function using multiprocessing.dummy. here is the link to the documentation :https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

    import math
    from multiprocessing.dummy import Pool as ThreadPool
    
    ...
    
    def insert_df(df, *args, **kwargs):
        nworkers = 4 # number of workers that executes insert in parallel fashion
    
        chunk = math.floor(df.shape[0] / nworkers) # number of chunks
        chunks = [(chunk * i, (chunk * i) + chunk) for i in range(nworkers)]
        chunks.append((chunk * nworkers, df.shape[0]))
        pool = ThreadPool(nworkers)
    
        def worker(chunk):
            i, j = chunk
            df.iloc[i:j, :].to_sql(*args, **kwargs)
    
        pool.map(worker, chunks)
        pool.close()
        pool.join()
    
    ....
    
    insert_df(df, "foo_bar", engine, if_exists='append')
    

    The second method was suggested at https://stackoverflow.com/a/42164138/5614132.