I have a CSV file that is constantly getting new data chugged in (new rows appended at the end) by another program.
My goal is to read this file in python and push the incoming data on-line in a SQL database. I do not want to wait (or rather I cannot afford) that the file is finished before pushing the data. Once the file is open, the program should constantly wait for new lines to be pushed on the DB.
Pandas embeds a very useful method to_sql that I am currently using to push the data on the database, which is why I'd love to keep using pandas.
Pandas' read_csv method accepts a chunk_size argument that I tried to use. The behavior is interesting, however the file is not reloaded each time a chunk is queried, hence not solving my problem. In this example
df = pd.read_csv('filename.csv', chunksize=1)
time.sleep(10)
df.get_chunk(5)
if the data in the file is modified during the sleep, it will not be catched when the chunks are queried.
Does anyone has an idea how could I do ?
Thanks in advance
In general: just don't. If you're reading from a file that is open by another process, you cannot be sure that you're reading valid CSV (because the writing process could be halfway through the line).
If you insist though, there are a few ways to do it. The first would be using watchdog
. This Python library makes it possible to "listen" for file changes. As soon as the file has changed (because the other process wrote new lines) you can call pandas.read_csv
again.
Another way would be to periodically (e.g. every 10 seconds) check the contents of the file. This an example of what such a program would look like:
import time
import pandas
def check_updates(n_rows_read: int) -> int:
""" Check for updates in a file, starting from a specific row number.
Args:
n_rows_read (int): The number of rows in the file that are already read
Returns:
int: The total number of lines that are now processed in the file
"""
# Read the file, starting from where we left off earlier
df = pandas.read_csv("./test.csv", skiprows=n_rows_read)
# TODO: Implement your logic here, and push to your SQL database
# df.to_sql(...)
return df.shape[0] + n_rows_read
if __name__ == "__main__":
# In the begining we haven't read any lines yet
n_rows_read = 0
# Start an infinite loop
while True:
# Get updates, and keep track of how many rows of the file are processed
n_rows_read = check_updates(n_rows_read)
# Wait for a bit
time.sleep(10)
NOTE: You will need to modify this code for your purposes