I have a file written in the Deltalake/Parquet format which has a map as one of the columns. The map stores various properties of the row entry in a "property_name": "property_value" format. I'd like to filter on a particular property stored in this map column, preferably before loading the table into memory using predicate pushdown if available.
This was my attempt to solve the problem via accessing the key-value data through nested fields:
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
data = {'Name': ['Name1', 'Name2', 'Name3'],
'Trial_Map': [{'a': 'a1', 'b': 'b1'}, {'a': 'a2', 'b': 'b2'}, {'a': 'a1', 'b': 'b3'}]}
df = pd.DataFrame(data)
schema = pa.schema([
('Name', pa.string()),
('Trial_Map', pa.map_(pa.string(), pa.string()))
])
table = pa.Table.from_pandas(df, schema=schema)
writer = pq.ParquetWriter("example.parquet", table.schema)
writer.write_table(table)
writer.close()
import pyarrow.parquet as pq
import pyarrow.dataset as ds
condition = ((ds.field("Trial_Map", "Trial_Map", "key") == "a") &
(ds.field("Trial_Map", "Trial_Map", "value") == "a1"))
table = pq.read_table("example.parquet", filters=condition)
print(table.schema)
However, this code gave me the following error:
pyarrow.lib.ArrowNotImplementedError: Function 'struct_field' has no kernel matching input types (map<string, string ('Trial_Map')>)
I would appreciate any help in solving this error/pointing me to any other methods of performing this pre-loading filter. Thanks for your time!
PyArrow currently doesn't support directly selecting the values for a certain key using a nested field referenced (as you were trying with ds.field("Trial_Map", "key")
), but there is a compute function that allows selecting those values, i.e. "map_lookup".
If we can assume that each key occurs only once in each map element (i.e. no duplicates per row), we can define a filter like this:
import pyarrow.compute as pc
map_filter = pc.map_lookup(pc.field("Trial_Map"), pa.scalar("a"), "first") == "a1"
Using it on your example, you can see it working in practice:
>>> pq.read_table("example.parquet").to_pandas()
Name Trial_Map
0 Name1 [(a, a1), (b, b1)]
1 Name2 [(a, a2), (b, b2)]
2 Name3 [(a, a1), (b, b3)]
>>> pq.read_table("example.parquet", filters=map_filter).to_pandas()
Name Trial_Map
0 Name1 [(a, a1), (b, b1)]
1 Name3 [(a, a1), (b, b3)]
The reason that this only works when there is only a single "a" key per map element is because I have to specify "first" to get the first value for key "a" for each element. If there are multiple ones, you can specify "all", but then you get a ListArray as result.
Applying the compute function directly on the table to illustrate this:
>>> pc.map_lookup(table["Trial_Map"].chunk(0), pa.scalar("a"), "first")
<pyarrow.lib.StringArray object at 0x7f099d2b8040>
[
"a1",
"a2",
"a1"
]
>>> pc.map_lookup(table["Trial_Map"].chunk(0), pa.scalar("a"), "all")
<pyarrow.lib.ListArray object at 0x7f099d3134c0>
[
[
"a1"
],
[
"a2"
],
[
"a1"
]
]
And then if we have this ListArray, the element-wise equality == "a1"
doesn't work out of the box (there is an open enhancement request about this to add a function to check if a list contains some value: https://github.com/apache/arrow/issues/33295)