In Python polars convert a json string column to dict for filtering

Hi have a dataframe where I have a column called tags which is a json string.

I want to filter this dataframe on the tags column so it only contains rows where a certain tag key is present or where a tag has a particular value.

I guess I could do a string contains match but think it may be more robust to have the json convert into a dict first and using has_key etc ?

What would be the recommended way to do this in python polars ?

Thanks

Solution

Polars does not have a generic dictionary type. Instead, dictionaries are imported/mapped as structs. Each dictionary key is mapped to a struct 'field name', and the corresponding dictionary value becomes the value of this field.

However, there are some constraints for creating a Series of type struct. Two of them are:

all structs must have the same field names.
the field names must be listed in the same order.

In your description, you mention has_key, which indicates that the dictionaries will not have the same keys. As such, creating a column of struct from your dictionaries will not work. (For more information, you can see this Stack Overflow response.)

`json_path_match`

I suggest using json_path_match, which extracts values based on some simple JSONPath syntax. Using JSONPath syntax, you should be able to query whether a key exists, and retrieve it's value. (For simple unnested dictionaries, these are the same query.)

For example, let's start with this data:

import polars as pl

json_list = [
    """{"name": "Maria",
        "position": "developer",
        "office": "Seattle"}""",
    """{"name": "Josh",
        "position": "analyst",
        "termination_date": "2020-01-01"}""",
    """{"name": "Jorge",
        "position": "architect",
        "office": "",
        "manager_st_dt": "2020-01-01"}""",
]

df = pl.DataFrame(
    {
        "tags": json_list,
    }
).with_row_index("id", 1)
df

shape: (3, 2)
┌─────┬───────────────────────────────────────────┐
│ id  ┆ tags                                      │
│ --- ┆ ---                                       │
│ u32 ┆ str                                       │
╞═════╪═══════════════════════════════════════════╡
│ 1   ┆ {"name": "Maria",                         │
│     ┆         "position": "developer",          │
│     ┆         "office": "Seattle"}              │
│ 2   ┆ {"name": "Josh",                          │
│     ┆         "position": "analyst",            │
│     ┆         "termination_date": "2020-01-01"} │
│ 3   ┆ {"name": "Jorge",                         │
│     ┆         "position": "architect",          │
│     ┆         "office": "",                     │
│     ┆         "manager_st_dt": "2…              │
└─────┴───────────────────────────────────────────┘

To query for values:

df.with_columns(
    pl.col("tags").str.json_path_match(r"$.name").alias("name"),
    pl.col("tags").str.json_path_match(r"$.office").alias("location"),
    pl.col("tags").str.json_path_match(r"$.manager_st_dt").alias("manager start date"),
)

shape: (3, 5)
┌─────┬───────────────────────────────────────────┬───────┬──────────┬────────────────────┐
│ id  ┆ tags                                      ┆ name  ┆ location ┆ manager start date │
│ --- ┆ ---                                       ┆ ---   ┆ ---      ┆ ---                │
│ u32 ┆ str                                       ┆ str   ┆ str      ┆ str                │
╞═════╪═══════════════════════════════════════════╪═══════╪══════════╪════════════════════╡
│ 1   ┆ {"name": "Maria",                         ┆ Maria ┆ Seattle  ┆ null               │
│     ┆         "position": "developer",          ┆       ┆          ┆                    │
│     ┆         "office": "Seattle"}              ┆       ┆          ┆                    │
│ 2   ┆ {"name": "Josh",                          ┆ Josh  ┆ null     ┆ null               │
│     ┆         "position": "analyst",            ┆       ┆          ┆                    │
│     ┆         "termination_date": "2020-01-01"} ┆       ┆          ┆                    │
│ 3   ┆ {"name": "Jorge",                         ┆ Jorge ┆          ┆ 2020-01-01         │
│     ┆         "position": "architect",          ┆       ┆          ┆                    │
│     ┆         "office": "",                     ┆       ┆          ┆                    │
│     ┆         "manager_st_dt": "2…              ┆       ┆          ┆                    │
└─────┴───────────────────────────────────────────┴───────┴──────────┴────────────────────┘

Notice the null values. This is the return value when a key is not found. We'll use this fact for the has_key functionality you mentioned.

Also, if we look at the "location" column, you'll see that json_path_match does distinguish between an empty string "office":"" and a key not found..

To filter for the presence of a key, we simply filter for null values.

df.filter(
    pl.col("tags").str.json_path_match(r"$.manager_st_dt").is_not_null()
)

shape: (1, 2)
┌─────┬───────────────────┐
│ id  ┆ tags              │
│ --- ┆ ---               │
│ u32 ┆ str               │
╞═════╪═══════════════════╡
│ 3   ┆ {"name": "Jorge", │
│     ┆         "posit... │
└─────┴───────────────────┘

The json_path_match will also work with nested structures. (See the Syntax page for details.)

One limitation, however: json_path_match will only return the first match for a query, rather than a list of matches. If your JSON strings are not lists or nested dictionaries, this won't be a problem.