Search code examples
pysparkparquetpython-polarsfastparquetduckdb

Querying last row of sorted column where value is less than specific amount from parquet file


I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below.

    X   Y
0   1   Red
1   5   Blue
2   8   Green
3   12  Purple
4   15  Blue
5   17  Purple

I am interested in querying the last value of column Y given that X is less than some amount in the most efficient way possible using python.

I am guaranteed that column X is sorted in ascending order.

As an example, given that X is less than 11, I would expect a Y value of "Green".

I have tried the following:

columns='Y'
filters=[('X','<',11]

pd.read_parquet('my_data.parquet',filters=filters,columns=columns).tail(1)

The code above "works" but I am hoping optimizations are possible as this query is run 1M+ times per day.

The parquet file is too large to be read into memory.

I cannot put a starting value for column "X", as there is no guarantee of the size of the gap between values of X. For example, if I were to require "X > 10 and X < 11" I would not get a value for Y returned.

I was hoping given the fact the data is sorted there is a way to optimize this.

I am open to using DuckDB or some other library to do this.


Solution

  • I think that's what .search_sorted() is for.

    You can also use .scan_parquet() to lazy load the data instead of .read_parquet()

    You may need to use when/then to handle the case of the first row being a match - and using index 0 instead of row - 1 - or the case of there being no match (if that's possible.)

    (pl.scan_parquet("search.parquet")
       .select(
          pl.col("Y")
            .take(pl.col("X").search_sorted(11, side="left") - 1)
    ).collect()
    
    shape: (1, 1)
    ┌───────┐
    │ Y     │
    │ ---   │
    │ str   │
    ╞═══════╡
    │ Green │
    └───────┘