I have this pandas code:
def last_non_null(s):
return s.dropna().iloc[-1] if not s.dropna().empty else np.nan
def merge_rows_of_final_df(df_final):
# Group by columns A, B, and C
cols = ['A', 'B', 'C']
# Apply function to each column not in the subset
# NOTE: the other columns have np.float32 as dtype
agg_dict = {col: last_non_null for col in df_final.columns.difference(cols)}
# Group and aggregate
return df_final.groupby(cols).agg(agg_dict).reset_index()
df_merged = merge_rows_of_final_df(df_final)
Can I optimize this with polars? If so, can someone provide me the polars snippet with the same output? The idea is that final_df is the result of concat a list of dataframes that have duplicates on the subset columns A,B,C. I wish to merge the rows of those groups by taking the last/first not-null value of each column.
Example input:
df = pd.DataFrame({
"A":[1,1,1,1,4,4,4],
"B":[2,2,2,2,5,5,5],
"C":[3,3,3,3,6,6,6],
"D":[4, np.nan, np.nan, np.nan, 1, np.nan, np.nan],
"E":[np.nan, np.nan, np.nan, 7, np.nan, np.nan, 2],
"F":[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
})
Example output:
A B C D E F
0 1 2 3 4.0 7.0 NaN
1 4 5 6 1.0 2.0 NaN
You can drop_nulls()
and then take the last()
df = pl.DataFrame({
"A":[1,1,1,1,4,4,4],
"B":[2,2,2,2,5,5,5],
"C":[3,3,3,3,6,6,6],
"D":[4, None, None, None, 1, None, None],
"E":[None, None, None, 7, None, None, 2],
"F":[None, None, None, None, None, None, None],
})
df.group_by("A", "B", "C").agg(
pl.all().drop_nulls().last()
)
shape: (2, 6)
┌─────┬─────┬─────┬─────┬─────┬──────┐
│ A ┆ B ┆ C ┆ D ┆ E ┆ F │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ null │
╞═════╪═════╪═════╪═════╪═════╪══════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 7 ┆ null │
│ 4 ┆ 5 ┆ 6 ┆ 1 ┆ 2 ┆ null │
└─────┴─────┴─────┴─────┴─────┴──────┘