pyspark parquet python-polars fastparquet duckdb

Querying last row of sorted column where value is less than specific amount from parquet file

I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below.

    X   Y
0   1   Red
1   5   Blue
2   8   Green
3   12  Purple
4   15  Blue
5   17  Purple

I am interested in querying the last value of column Y given that X is less than some amount in the most efficient way possible using python.

I am guaranteed that column X is sorted in ascending order.

As an example, given that X is less than 11, I would expect a Y value of "Green".

I have tried the following:

columns='Y'
filters=[('X','<',11]

pd.read_parquet('my_data.parquet',filters=filters,columns=columns).tail(1)

The code above "works" but I am hoping optimizations are possible as this query is run 1M+ times per day.

The parquet file is too large to be read into memory.

I cannot put a starting value for column "X", as there is no guarantee of the size of the gap between values of X. For example, if I were to require "X > 10 and X < 11" I would not get a value for Y returned.

I was hoping given the fact the data is sorted there is a way to optimize this.

I am open to using DuckDB or some other library to do this.

Solution

I think that's what .search_sorted() is for.

You can also use .scan_parquet() to lazy load the data instead of .read_parquet()

You may need to use when/then to handle the case of the first row being a match - and using index 0 instead of row - 1 - or the case of there being no match (if that's possible.)

(pl.scan_parquet("search.parquet")
   .select(
      pl.col("Y")
        .take(pl.col("X").search_sorted(11, side="left") - 1)
).collect()

shape: (1, 1)
┌───────┐
│ Y     │
│ ---   │
│ str   │
╞═══════╡
│ Green │
└───────┘