Search code examples
pythonpandasjupyter-notebookparquetwatchdog

Why is my Python file watcher not writing the data from Parquet files to a data frame?


I have written a file watcher in Python that will watch a specific folder in my laptop and whenever a new parquet file is created in it, the watcher will pull it and read the data inside using Pandas and construct a data frame from it.

Issue: It does all those activities with perfection except the last bit where it has to write the data to the data frame

Here is the code I have written:

# Imports and decalarations

import os
import sys
import time
import pathlib
import pandas as pd
import pyarrow.parquet as pq

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler, PatternMatchingEventHandler
# Eventhandler class

class Handler(FileSystemEventHandler):
    
    def on_created(self, event):
        
        # Import Data

        filepath = pathlib.PureWindowsPath(event.src_path).as_posix()
        time.sleep(10) # To allow time to complete file write to disk
        dataset = pd.read_parquet(filepath, engine='pyarrow')
        dataset = dataset.reset_index(drop=True)
        dataset.head()

# Code to run for Python Interpreter

if __name__ == "__main__":
    
    path = r"D:\Folder1\Folder2\Folder3" # Path to watch
    
    observer = Observer()
    event_handler = Handler()
    observer.schedule(event_handler, path, recursive=True)
    observer.start()
    
    try:
        while(True):
            pass
            
    except KeyboardInterrupt:
        observer.stop()
        observer.join()

The expected output is the first five rows of the data frame, however, it shows me nothing and I get no error either.

Some Useful Information

  • I have been running this code in Jupyter Notebook.

  • However, I have also run it in Spyder to see whether a data frame appears at all in its Variable Explorer section. But it didn't.

From this, the natural conclusion would be that the data frame isn't getting created at all. But this is what baffles me. Because I have successfully read this same parquet file from a somewhat less sophisticated code (below) yesterday where I fed the file path as a raw string.

# Less Sophisticated Code

filepath = r"D:\Folder1\Folder2\Folder3\filename.parquet"

dataset = pd.read_parquet(filepath, engine='pyarrow')
dataset = dataset.reset_index(drop=True) # Resets index of dataframe and replaces with integers
dataset.head()

Output Screenshot (In Jupyter Notebook)

Is the filepath the issue then? I am very happy to provide any other information you may need.

Edit: I have added a screenshot of the output from the code that did not have a file watcher


Solution

  • If you don't print dataset.head(), there will be nothing to display unlike dataset.info():

    class Handler(FileSystemEventHandler):
        
        def on_created(self, event):
            
            # Import Data
    
            filepath = pathlib.PureWindowsPath(event.src_path).as_posix()
            time.sleep(10) # To allow time to complete file write to disk
            dataset = pd.read_parquet(filepath, engine='pyarrow')
            dataset = dataset.reset_index(drop=True)
            print(dataset.head())  # <- HERE
    

    Else your code works for me.

    Note: prefer use Path instead of PureWindowsPath.