I am doing a lot of ETL with Pandas and Postgres. I have a ton of idle connections, many marked with COMMIT
and ROLLBACK
, that I am not sure how to prevent from sitting as idle for long periods rather than closing. The main code I use to write to the database is using pandas to_sql
:
def write_data_frame(self, data_frame, table_name):
engine = create_engine(self.engine_string)
data_frame.to_sql(name=table_name, con=engine, if_exists='append', index=False)
I know this is definitely not best practice for PostgreSQL and I should be doing something like passing params to a Stored Procedure or Function or something, but this is how we are setup to get data_frames from non-Postgres databases / data sources and upload to Postgres.
My pgAdmin looks like this:
Can someone please point me in the right direction of how to avoid this many idle connections in the future? Some of our database connections are meant to be long-lived as they are continuous "batch" processes. But it seems like some one-off events are leaving connections open and idle.
Using the engine
as a one-off is probably not ideal for you. If possible, you could make the engine a member of the class and call it as self.engine
.
Another option would be to explicitly dispose of the engine.
def write_data_frame(self, data_frame, table_name):
engine = create_engine(self.engine_string)
data_frame.to_sql(name=table_name, con=engine, if_exists='append', index=False)
engine.dispose()
As noted in the docs,
This has the effect of fully closing all currently checked in database connections. Connections that are still checked out will not be closed, however they will no longer be associated with this Engine, so when they are closed individually, eventually the Pool which they are associated with will be garbage collected and they will be closed out fully, if not already closed on checkin.
This may also be a good use case for a try...except...finally
block since .dispose
will only be called when the preceding code executes without error.
I would much rather be suggesting to you that you pass connections like so:
with engine.connect() as connection:
data_frame.to_sql(..., con=connection)
But the to_sql
docs indicate that you can't do that and they will only accept an engine