Search code examples
pythondataframepython-polars

How to propagate `null` in a column after first occurrence?


I have 2 data sets:

The first one describes what I expect:

expected = {
    "name": ["start", "stop", "start", "stop", "start", "stop", "start", "stop"],
    "description": ["a", "b", "c", "d", "e", "f", "g", "h"],
}

and the second one describes what I observe:

observed = {
    "name": ["start", "stop", "start", "stop", "stop", "stop", "start"],
    "time": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
}

I want to match all my observations to descriptions based on the order I expect. But once I see an inconsistency, nothing should match anymore.

I managed to find the first inconsistency like:

observed_df = pl.DataFrame(observed).with_row_index()
expected_df = pl.DataFrame(expected).with_row_index()

result = observed_df.join(expected_df, on=["index", "name"], how="left").select(
    "description", "time"
)
"""
┌─────────────┬──────┐
│ description ┆ time │
│ ---         ┆ ---  │
│ str         ┆ f64  │
╞═════════════╪══════╡
│ a           ┆ 0.1  │
│ b           ┆ 0.2  │
│ c           ┆ 0.3  │
│ d           ┆ 0.4  │
│ null        ┆ 0.5  │   -> First inconsistency gets a "null" description
│ f           ┆ 0.6  │
│ g           ┆ 0.7  │
└─────────────┴──────┘
"""

How can I propagate this null passed the first inconsistency?

Also, my real data has an additional id column, where each id is a case like described above, and independent from other ids. Is it possible to somehow "group by id" and apply this logic all at once instead of working with each id separately:

observed = {
    "id": [1, 2, 1, 2, 2],
    "name": ["start", "start", "stop", "stop", "stop"],
    "time": [0.1, 0.2, 0.3, 0.4, 0.5],
}

expected = {
    "id": [1, 1, 2, 2],
    "name": ["start", "stop", "start", "stop"],
    "description": ["a", "b", "c", "d"],
}

result = {
    "id": [1, 2, 1, 2, 2],
    "description": ["a", "c", "b", "d", None],
    "time": [0.1, 0.2, 0.3, 0.4, 0.5],
}

Solution

  • The check whether any null value appeared in an increasing window can be done using a cumulative evaluation, such as pl.Expr.cum_sum.

    A when-then-otherwise construct can be used to propagate null values accordingly.

    In your example, this might look as follows.

    (
        observed_df
        .join(
            expected_df,
            on=["index", "name"],
            how="left",
        )
        .select("description", "time")
        .with_columns(
            pl.when(
                pl.col("description").is_null().cum_sum() == 0
            ).then(
                "description"
            )
        )
    )
    
    shape: (7, 2)
    ┌─────────────┬──────┐
    │ description ┆ time │
    │ ---         ┆ ---  │
    │ str         ┆ f64  │
    ╞═════════════╪══════╡
    │ a           ┆ 0.1  │
    │ b           ┆ 0.2  │
    │ c           ┆ 0.3  │
    │ d           ┆ 0.4  │
    │ null        ┆ 0.5  │
    │ null        ┆ 0.6  │
    │ null        ┆ 0.7  │
    └─────────────┴──────┘
    

    If you'd like to evaluate this expression separately for each group defined by id, a window function, such as pl.Expr.over, might be used.

    ...
            pl.when(
                pl.col("description").is_null().cum_sum() == 0
            ).then(
                "description"
            ).over("id") # <--
    ...