I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below.
X Y
0 1 Red
1 5 Blue
2 8 Green
3 12 Purple
4 15 Blue
5 17 Purple
I am interested in querying the last value of column Y given that X is less than some amount in the most efficient way possible using python.
I am guaranteed that column X is sorted in ascending order.
As an example, given that X is less than 11, I would expect a Y value of "Green".
I have tried the following:
columns='Y'
filters=[('X','<',11]
pd.read_parquet('my_data.parquet',filters=filters,columns=columns).tail(1)
The code above "works" but I am hoping optimizations are possible as this query is run 1M+ times per day.
The parquet file is too large to be read into memory.
I cannot put a starting value for column "X", as there is no guarantee of the size of the gap between values of X. For example, if I were to require "X > 10 and X < 11" I would not get a value for Y returned.
I was hoping given the fact the data is sorted there is a way to optimize this.
I am open to using DuckDB or some other library to do this.
I think that's what .search_sorted()
is for.
You can also use .scan_parquet()
to lazy load the data instead of .read_parquet()
You may need to use when/then
to handle the case of the first row being a match - and using index 0
instead of row - 1
- or the case of there being no match (if that's possible.)
(pl.scan_parquet("search.parquet")
.select(
pl.col("Y")
.take(pl.col("X").search_sorted(11, side="left") - 1)
).collect()
shape: (1, 1)
┌───────┐
│ Y │
│ --- │
│ str │
╞═══════╡
│ Green │
└───────┘