Search code examples
python-polarsrust-polars

In Python polars convert a json string column to dict for filtering


Hi have a dataframe where I have a column called tags which is a json string.

I want to filter this dataframe on the tags column so it only contains rows where a certain tag key is present or where a tag has a particular value.

I guess I could do a string contains match but think it may be more robust to have the json convert into a dict first and using has_key etc ?

What would be the recommended way to do this in python polars ?

Thanks


Solution

  • Polars does not have a generic dictionary type. Instead, dictionaries are imported/mapped as structs. Each dictionary key is mapped to a struct 'field name', and the corresponding dictionary value becomes the value of this field.

    However, there are some constraints for creating a Series of type struct. Two of them are:

    • all structs must have the same field names.
    • the field names must be listed in the same order.

    In your description, you mention has_key, which indicates that the dictionaries will not have the same keys. As such, creating a column of struct from your dictionaries will not work. (For more information, you can see this Stack Overflow response.)

    json_path_match

    I suggest using json_path_match, which extracts values based on some simple JSONPath syntax. Using JSONPath syntax, you should be able to query whether a key exists, and retrieve it's value. (For simple unnested dictionaries, these are the same query.)

    For example, let's start with this data:

    import polars as pl
    
    json_list = [
        """{"name": "Maria",
            "position": "developer",
            "office": "Seattle"}""",
        """{"name": "Josh",
            "position": "analyst",
            "termination_date": "2020-01-01"}""",
        """{"name": "Jorge",
            "position": "architect",
            "office": "",
            "manager_st_dt": "2020-01-01"}""",
    ]
    
    df = pl.DataFrame(
        {
            "tags": json_list,
        }
    ).with_row_index("id", 1)
    df
    
    shape: (3, 2)
    ┌─────┬───────────────────────────────────────────┐
    │ id  ┆ tags                                      │
    │ --- ┆ ---                                       │
    │ u32 ┆ str                                       │
    ╞═════╪═══════════════════════════════════════════╡
    │ 1   ┆ {"name": "Maria",                         │
    │     ┆         "position": "developer",          │
    │     ┆         "office": "Seattle"}              │
    │ 2   ┆ {"name": "Josh",                          │
    │     ┆         "position": "analyst",            │
    │     ┆         "termination_date": "2020-01-01"} │
    │ 3   ┆ {"name": "Jorge",                         │
    │     ┆         "position": "architect",          │
    │     ┆         "office": "",                     │
    │     ┆         "manager_st_dt": "2…              │
    └─────┴───────────────────────────────────────────┘
    

    To query for values:

    df.with_columns(
        pl.col("tags").str.json_path_match(r"$.name").alias("name"),
        pl.col("tags").str.json_path_match(r"$.office").alias("location"),
        pl.col("tags").str.json_path_match(r"$.manager_st_dt").alias("manager start date"),
    )
    
    shape: (3, 5)
    ┌─────┬───────────────────────────────────────────┬───────┬──────────┬────────────────────┐
    │ id  ┆ tags                                      ┆ name  ┆ location ┆ manager start date │
    │ --- ┆ ---                                       ┆ ---   ┆ ---      ┆ ---                │
    │ u32 ┆ str                                       ┆ str   ┆ str      ┆ str                │
    ╞═════╪═══════════════════════════════════════════╪═══════╪══════════╪════════════════════╡
    │ 1   ┆ {"name": "Maria",                         ┆ Maria ┆ Seattle  ┆ null               │
    │     ┆         "position": "developer",          ┆       ┆          ┆                    │
    │     ┆         "office": "Seattle"}              ┆       ┆          ┆                    │
    │ 2   ┆ {"name": "Josh",                          ┆ Josh  ┆ null     ┆ null               │
    │     ┆         "position": "analyst",            ┆       ┆          ┆                    │
    │     ┆         "termination_date": "2020-01-01"} ┆       ┆          ┆                    │
    │ 3   ┆ {"name": "Jorge",                         ┆ Jorge ┆          ┆ 2020-01-01         │
    │     ┆         "position": "architect",          ┆       ┆          ┆                    │
    │     ┆         "office": "",                     ┆       ┆          ┆                    │
    │     ┆         "manager_st_dt": "2…              ┆       ┆          ┆                    │
    └─────┴───────────────────────────────────────────┴───────┴──────────┴────────────────────┘
    
    

    Notice the null values. This is the return value when a key is not found. We'll use this fact for the has_key functionality you mentioned.

    Also, if we look at the "location" column, you'll see that json_path_match does distinguish between an empty string "office":"" and a key not found..

    To filter for the presence of a key, we simply filter for null values.

    df.filter(
        pl.col("tags").str.json_path_match(r"$.manager_st_dt").is_not_null()
    )
    
    shape: (1, 2)
    ┌─────┬───────────────────┐
    │ id  ┆ tags              │
    │ --- ┆ ---               │
    │ u32 ┆ str               │
    ╞═════╪═══════════════════╡
    │ 3   ┆ {"name": "Jorge", │
    │     ┆         "posit... │
    └─────┴───────────────────┘
    

    The json_path_match will also work with nested structures. (See the Syntax page for details.)

    One limitation, however: json_path_match will only return the first match for a query, rather than a list of matches. If your JSON strings are not lists or nested dictionaries, this won't be a problem.