Search code examples
pythondata-sciencepython-polars

Fill NaN values in Polars using a custom-defined function for a specific column


I have this code in pandas:

df[col] = (
            df[col]
            .fillna(method="ffill", limit=1)
            .apply(lambda x: my_function(x))
        )

I want to re-write this in Polars.

I have tried this:

df = df.with_columns(
            pl.col(col)
            .fill_null(strategy="forward", limit=1)
            .map_elements(lambda x: my_function(x))
        )

It does not work properly. It fills with forward strategy but ignores filling missing values with my defined function. What should I change in my code to get what I want?

try this code:

df_polars = pl.DataFrame(
    {"A": [1, 2, None, None, None, None, 4, None], "B": [5, None, None, None, None, 7, None, 9]}
)

df_pandas = pd.DataFrame(
    {"A": [1, 2, None, None, None, None, 4, None], "B": [5, None, None, None, None, 7, None, 9]}
)

last_valid_data: int


def my_function(x):
    global last_valid_data
    if x == None or np.isnan(x):
        result = last_valid_data * 10
    else:
        last_valid_data = x
        result = x
    return result


col = "A"

last_valid_data = df_pandas[col][0]
df_pandas[col] = df_pandas[col].fillna(method="ffill", limit=1).apply(lambda x: my_function(x))

last_valid_data = df_polars[col][0]
df_polars = df_polars.with_columns(
    pl.col(col).fill_null(strategy="forward", limit=1).map_elements(lambda x: my_function(x))
)

Desired output in pandas is:

      A    B
0   1.0  5.0
1   2.0  NaN
2   2.0  NaN
3  20.0  NaN
4  20.0  NaN
5  20.0  7.0
6   4.0  NaN
7   4.0  9.0

What I get in Polars is:

┌──────┬──────┐
│ A    ┆ B    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 5    │
│ 2    ┆ null │
│ 2    ┆ null │
│ null ┆ null │
│ null ┆ null │
│ null ┆ 7    │
│ 4    ┆ null │
│ 4    ┆ 9    │
└──────┴──────┘

Solution

  • The issue here is that in Polars .map_elements defaults to skip_nulls=True

    df_polars.with_columns(
       pl.col('A').map_elements(lambda me: print(f'{me=}'))
    )
    
    me=1
    me=2
    me=4
    

    As your example specifically needs to target the nulls, you need to change this to False

    df_polars.with_columns(
       pl.col('A').map_elements(lambda me: print(f'{me=}'), skip_nulls=False)
    )
    
    me=1
    me=2
    me=None
    me=None
    me=None
    me=None
    me=4
    me=None