I have a Pandas dataframe in the form of:
Time Temperature Voltage Current
0.0 7.8 14 56
0.1 7.9 12 58
0.2 7.6 15 55
... So on for a few hundred thousand rows...
I need to bulk insert the data into a PostgreSQL database, as fast as possible. This is for a Django project, and I'm currently using the ORM for DB operations and building queries, but open to suggestions if there are more efficient ways to accomplish the task.
My data model looks like this:
class Data(models.Model):
time = models.DateTimeField(db_index=True)
parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
parameter_value = models.FloatField()
So Time
is row[0]
of the DataFrame, and then for each header column, I grab the value that corresponds to it, using the header as parameter
. So row[0]
of the example table would generate 3 Data
objects in my database:
Data(time=0.0, parameter="Temperature", parameter_value=7.8)
Data(time=0.0, parameter="Voltage", parameter_value=14)
Data(time=0.0, parameter="Current", parameter_value=56)
Our application allows the user to parse data files that are measured in milliseconds. So we generate a LOT of individual data objects from a single file. My current task is to improve the parser to make it much more efficient, until we hit I/O constraints on a hardware level.
My current solution is to go through each row, create one Data
object for each row on time + parameter + value
and append said object to an array so I can Data.objects.bulk_create(all_data_objects)
through Django. Of course I am aware that this is inefficient and could probably be improved a lot.
Using this code:
# Convert DataFrame to dict
df_records = df.to_dict('records')
# Start empty dta array
all_data_objects = []
# Go through each row creating objects and appending to data array
for row in df_records:
for parameter, parameter_value in row.items():
if parameter != "Time":
all_data_objects.append(Data(
time=row["Time"],
parameter_value=parameter_value,
parameter=parameter))
# Commit data to Postgres DB
Data.objects.bulk_create(all_data)
Currently the entire operation, without the DB insert operation included (writing to disk), that is, just generating the Data
objects array, for a 55mb file that generates about 6 million individual Data
objects takes around 370 seconds. Just the df_records = df.to_dict('records')
line takes 83ish seconds. Times were measured using time.time()
at both ends of each section and calculating the difference.
How can I improve these times?
If you really need a fast solution I suggest you dumb the table directly using pandas
.
First let's create the data for your example:
import pandas as pd
data = {
'Time': {0: 0.0, 1: 0.1, 2: 0.2},
'Temperature': {0: 7.8, 1: 7.9, 2: 7.6},
'Voltage': {0: 14, 1: 12, 2: 15},
'Current': {0: 56, 1: 58, 2: 55}
}
df = pd.DataFrame(data)
Now you should transform the dataframe so that you have the desired columns with melt
:
df = df.melt(["Time"], var_name="parameter", value_name="parameter_value")
At this point you should map the parameter
values to the foreign id
. I will use params
as an example:
params = {"Temperature": 1, "Voltage": 2, "Current": 3}
df["parameter"] = df["parameter"].map(params)
At this point the dataframe will look like:
Time parameter parameter_value
0 0.0 1 7.8
1 0.1 1 7.9
2 0.2 1 7.6
3 0.0 2 14.0
4 0.1 2 12.0
5 0.2 2 15.0
6 0.0 3 56.0
7 0.1 3 58.0
8 0.2 3 55.0
And now to export using pandas you can use:
import sqlalchemy as sa
engine = sa.create_engine("use your connection data")
df.to_sql(name="my_table", con=engine, if_exists="append", index=False)
However when I used that it was not fast enough to meet our requirements. So I suggest you use cursor.copy_from
insted since is faster:
from io import StringIO
output = StringIO()
df.to_csv(output, sep=';', header=False, index=False, columns=df.columns)
output.getvalue()
# jump to start of stream
output.seek(0)
# Insert df into postgre
connection = engine.raw_connection()
with connection.cursor() as cursor:
cursor.copy_from(output, "my_table", sep=';', null="NULL", columns=(df.columns))
connection.commit()
We tried this for a few millions and it was the fastest way when using PostgreSQL.