Search code examples
python-3.xdaskrapidscudf

NVidia Rapids filter neither works nor raises warn/errors


I am using Rapids 23.04 and trying to select reading from parquet/orc files based on select columns and rows. However, strangely the row filter is not working and I am unable to find the cause. Any help would be greatly appreciated. A proof of concept is given below. Neither dask_cudf nor cudf seem to work:

Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cudf
>>> import dask_cudf
>>> df = cudf.DataFrame(
...     {
...         "a": list(range(200)),
...         "b": list(reversed(range(200))),
...         "c": list(range(200)),
...         "d": list(reversed(range(200))),
...     }
... )
>>> df
       a    b    c    d
0      0  199    0  199
1      1  198    1  198
2      2  197    2  197
3      3  196    3  196
4      4  195    4  195
..   ...  ...  ...  ...
195  195    4  195    4
196  196    3  196    3
197  197    2  197    2
198  198    1  198    1
199  199    0  199    0

[200 rows x 4 columns]
>>> df.to_parquet('test.parquet')
>>> df.to_orc('test.orc')
>>> cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
       a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199

[200 rows x 2 columns]
>>> ddf = dask_cudf.read_parquet('test.parquet', columns=['a','c'], filters=[("a", "<", 150)])
>>> ddf.compute()
       a    c
0      0    0
1      1    1
2      2    2
3      3    3
4      4    4
..   ...  ...
195  195  195
196  196  196
197  197  197
198  198  198
199  199  199

[200 rows x 2 columns]
>>>

PS: My data size could be very large, hence dask_cudf is more appropriate, though in a few cases cudf could be adequate.


Solution

  • TL;DR: filter= currently filters row groups. Beginning with cuDF version 23.06, filter= will filter by single rows and behave exactly as you expect it to.

    In the current version of cuDF, the filter= argument is used to filter row groups (rather than filtering individual rows). This is best explained with an example:

    The following snippet writes a Parquet file with 3 rows per row group:

    >>> df = pd.DataFrame({'a': range(20)})
    >>> df.to_parquet('test.parquet', row_group_size=3)  # 3 rows per group
    

    Some examples of using the filter= argument with cudf:

    >>> cudf.read_parquet('test.parquet', filters=[('a', '=', 3)])  # row group(s) containing a == 3
       a
    3  3
    4  4
    5  5
    >>> cudf.read_parquet('test.parquet', filters=[('a', '<', 10)])  # row group(s) containing a < 10
         a
    0    0
    1    1
    2    2
    3    3
    4    4
    5    5
    6    6
    7    7
    8    8
    9    9
    10  10
    11  11
    

    Admittedly this is not very intuitive. In cuDF 23.06, we will change filters= to apply to single rows, rather than row groups. By curious co-incidence, this improvement was merged to cuDF just a few minutes before you raised this question!