Search code examples
pythondaskparquet

Trying to filter in dask.read_parquet tries to compare NoneType and str


I have a project where I pass the following load_args to read_parquet:

filters = {'filters': [('itemId', '=', '9403cfde-7fe5-4c9c-916c-41ff0b595c5c')]}

According to the documentation, a List[Tuple] like this should be accepted and I should get all partitions which match the predicate (or equivalently, filter out those that do not).

However, it gives me the following error:

│                                                                                  │
│ /home/user/project/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/  |
| core.py:1275 in apply_conjunction                                                │
|                                                                                  |
|   1264 |   for part, stats in zip(parts, statistics):                            |
|   1265 |   |   |   |   if "filter" in stats and stats["filter"]:                 |
|   1266 |   |   |   |   |  continue  # Filtered by engine                         |
|   1267 |   |   |   |   try:                                                      |
|   1268 |   |   |   |   |  c = toolz.groupby("name", stats["columns"])[column][0] |
|   1269 |   |   |   |   |  min = c["min"]                                         |
|   1270 |   |   |   |   |  max = c["max"]                                         |
|   1271 |   |   |   |   except KeyError:                                          |
│   1272 │   │   │   │   │   out_parts.append(part)                                │
│   1273 │   │   │   │   │   out_statistics.append(stats)                          │
│   1274 │   │   │   │   else:                                                     │
│ ❱ 1275 │   │   │   │   │   if (                                                  │
│   1276 │   │   │   │   │   │   operator in ("==", "=")                           │
│   1277 │   │   │   │   │   │   and min <= value <= max                           │
│   1278 │   │   │   │   │   │   or operator == "!="                               │
╰──────────────────────────────────────────────────────────────────────────────────╯
TypeError: '<=' not supported between instances of 'NoneType' and 'str'

It seems that read_parquet tries to compute min and max values for my str value that I wish to filter on, but I'm not sure that makes sense in this case. Even so, str values should be comparable (though it might not make a huge amount of sense in this case, seeing how the itemId is a random UUID).

Still, I expected this to work. What am I doing wrong?


Solution

  • As discovered by aywandji in the aformentioned github issue, the problem comes from the way dask access the min/max metatada.

    It is accessed with an integer (the ith column) BUT this index of a column's name can change from one file to another in the same directory. (i.e. the filtered column is not at the same position in every file).

    It is currently being patched and we hope it will get inserted in the next dask release!

    From @filpa

    It is fixed starting with the dask=2023.1.1 release, which was released on 2023-01-28.