I am using Rapids 23.04 and trying to select reading from parquet
/orc
files based on select columns and rows. However, strangely the row filter is not working and I am unable to find the cause. Any help would be greatly appreciated. A proof of concept is given below. Neither dask_cudf
nor cudf
seem to work:
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cudf
>>> import dask_cudf
>>> df = cudf.DataFrame(
... {
... "a": list(range(200)),
... "b": list(reversed(range(200))),
... "c": list(range(200)),
... "d": list(reversed(range(200))),
... }
... )
>>> df
a b c d
0 0 199 0 199
1 1 198 1 198
2 2 197 2 197
3 3 196 3 196
4 4 195 4 195
.. ... ... ... ...
195 195 4 195 4
196 196 3 196 3
197 197 2 197 2
198 198 1 198 1
199 199 0 199 0
[200 rows x 4 columns]
>>> df.to_parquet('test.parquet')
>>> df.to_orc('test.orc')
>>> cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
a c
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
195 195 195
196 196 196
197 197 197
198 198 198
199 199 199
[200 rows x 2 columns]
>>> ddf = dask_cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
>>> ddf.compute()
a c
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
195 195 195
196 196 196
197 197 197
198 198 198
199 199 199
[200 rows x 2 columns]
>>>
PS: My data size could be very large, hence dask_cudf
is more appropriate, though in a few cases cudf
could be adequate.
TL;DR: filter=
currently filters row groups. Beginning with cuDF version 23.06, filter=
will filter by single rows and behave exactly as you expect it to.
In the current version of cuDF, the filter=
argument is used to filter row groups (rather than filtering individual rows). This is best explained with an example:
The following snippet writes a Parquet file with 3 rows per row group:
>>> df = pd.DataFrame({'a': range(20)})
>>> df.to_parquet('test.parquet', row_group_size=3) # 3 rows per group
Some examples of using the filter=
argument with cudf:
>>> cudf.read_parquet('test.parquet', filters=[('a', '=', 3)]) # row group(s) containing a == 3
a
3 3
4 4
5 5
>>> cudf.read_parquet('test.parquet', filters=[('a', '<', 10)]) # row group(s) containing a < 10
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
Admittedly this is not very intuitive. In cuDF 23.06, we will change filters=
to apply to single rows, rather than row groups. By curious co-incidence, this improvement was merged to cuDF just a few minutes before you raised this question!