I want to split a dataset with ~ 1 million rows based on a multiple occurring ID and a mode type (active, not active). When splitting, missing values in numeric columns should be interpolated and categorical values should be filled with ffill. Finally, the remaining zero values are to be dropped. For this I have written the following function:
objectList = list(df_sorted.select_dtypes(include=["O", "datetime64[ns]"]).columns)
floatList = list(df_sorted.select_dtypes(include=["float64"]).columns)
def fill_missing_values(df_group):
df_group[objectList] = df_group[objectList].ffill()
df_group[floatList] = df_group[floatList].interpolate(
method="linear", limit_direction="forward"
)
df_group.dropna()
return df_group
The function is now to be applied as follows:
df_nn = df_sorted.groupby(["ID", "Mode"]).apply(
lambda df_sorted: fill_missing_values(df_sorted)
)
The cell is executed without errors, but the output takes way too long. So my question is: Is this approach overall correct or am I missing something? And how can this code get more performance?
Input Data
df = pd.DataFrame(
{
"ID": ["0A", "0A", "0A", "0A", "0A", "1C", "1C", "1C", "1C"],
"MODE": [
"active",
"active",
"active",
"inactive",
"inactive",
"active",
"active",
"active",
"inactive",
],
"Signal1 ": [13, np.nan, 4, 11, np.nan, 22, 25, np.nan, 19],
"Signal2 ": [np.nan, 0.1, 0.3, "NaN", 4.5, "NaN", 2.0, 3.0, np.nan],
"Signal3 ": ["on", np.nan, np.nan, "off", np.nan, "on", np.nan, "on", np.nan],
}
)
df
ID MODE Signal1 Signal2 Signal3
0 0A active 13 NaN on
1 0A active NaN 0.1 NaN
2 0A active 4 0.3 NaN
3 0A inactive 11 NaN off
4 0A inactive NaN 4.5 NaN
5 1C active 22 NaN on
6 1C active 25 2.0 NaN
7 1C active NaN 3.0 on
8 1C inactive 19 NaN NaN
Desired Output after ffill and interpolation of ID "0A":
ID MODE Signal1 Signal2 Signal3
0 0A active 13.0 NaN on
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
3 0A inactive 11.0 NaN off
4 0A inactive 11.0 4.5 off
Desired Output after dropna of ID "0A":
ID MODE Signal1 Signal2 Signal3
0 0A active 8.5 0.1 on
1 0A active 4.0 0.3 on
ID MODE Signal1 Signal2 Signal3
0 0A inactive 11 4.5 off
IIUC, you want:
groupby
the ID and MODE columns and interpolate all numeric columnsgroupby
the ID and MODE columns and ffill all non-numeric columnsimport numpy as np
#replace string "NaN" with numpy.nan
df = df.replace("NaN", np.nan)
numeric = df.filter(like="Signal").select_dtypes(np.number).columns
others = df.filter(like="Signal").select_dtypes(None,np.number).columns
df[numeric] = df.groupby(["ID", "MODE"])[numeric].transform(pd.Series.interpolate, limit_direction="forward")
df[others] = df.groupby(["ID", "MODE"])[others].transform("ffill")
>>> df
ID MODE Signal1 Signal2 Signal3
0 0A active 13.0 NaN on
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
3 0A inactive 11.0 NaN off
4 0A inactive 11.0 4.5 off
5 1C active 22.0 NaN on
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on
8 1C inactive 19.0 NaN NaN
>>> df.dropna()
ID MODE Signal1 Signal2 Signal3
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
4 0A inactive 11.0 4.5 off
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on