I have a dataframe with 300,000 rows and 20 columns with a lot of them containing text. In Excel format this is 30 to 40 MB.
My target is to write this to the database in below 10min. How can I do this?
The question is very similar to this one. However most of the answers are directed at Postgres databases and the second answer, which might work with MS SQL, involves defining the whole table, which I don't want to do in order to have reusable code. I also checked other questions in the forum without success.
I have the following three requirements:
From experimenting I found a solution that takes 5 to 6 hours to complete. I provide some code below where you can test my current solution with dummy data. The only thing that needs to be replaced is the url_object.
#For database connection to INFOR LN database
from sqlalchemy import create_engine
from sqlalchemy.engine import URL
#To be able to use dataframes for data transformation
import pandas as pd
#Used for progress bar functionality
from tqdm import tqdm
#For reading JSON files
import json
#For random dataframe creation
import random
import string
content = open('config.json')
config = json.load(content)
db_user = config['user']
db_password = config['password']
url_object = URL.create(
"mssql+pyodbc"
, username=db_user
, password=db_password
, host="Server_Name"
, database="Database"
, query={"driver": "SQL Server Native Client 11.0"}
)
#Thank you chat GPT.
# Set random seed for reproducibility
random.seed(42)
# Generate random numbers between 0 and 1000000 for 5 columns
num_cols = ['num_col1', 'num_col2', 'num_col3', 'num_col4', 'num_col5']
data = {
col: [random.randint(0, 1000000) for _ in range(50000)] for col in num_cols
}
# Generate random texts with less than 50 characters for 15 columns
text_cols = ['text_col1', 'text_col2', 'text_col3', 'text_col4', 'text_col5',
'text_col6', 'text_col7', 'text_col8', 'text_col9', 'text_col10',
'text_col11', 'text_col12', 'text_col13', 'text_col14', 'text_col15']
for col in text_cols:
data[col] = [''.join(random.choices(string.ascii_letters + string.digits, k=random.randint(1, 50)))
for _ in range(50000)]
# Create DataFrame
df = pd.DataFrame(data)
engine = create_engine(url_object, fast_executemany=True)
df["Python_Script_Excecution_Timestamp"] = pd.Timestamp('now')
def chunker(seq, size):
# from http://stackoverflow.com/a/434328
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
def insert_with_progress(df):
chunksize = 50
with tqdm(total=len(df)) as pbar:
for i, cdf in enumerate(chunker(df, chunksize)):
replace = "replace" if i == 0 else "append"
cdf.to_sql("Testtable"
, engine
, schema="dbo"
, if_exists=replace
, index=False
, chunksize = 50
, method='multi'
)
pbar.update(chunksize)
insert_with_progress(df)
In my specific case I can't increase the chunk size because of an error that get's thrown if I do. Explanation is that MS SQL doesn't allow for more than 2100 parameters per insert. Explanation is here.
Error: ('07002', '[07002] [Microsoft][SQL Server Native Client 11.0]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')
This question and the answer from Gord Thompson helped :D
to_sql pyodbc count field incorrect or syntax error
Code is also super simple. Loads in 6min now.
engine = create_engine(url_object, fast_executemany=True)
df.to_sql("Testtable"
, engine
, schema="dbo"
, index=False
, if_exists="replace"
)