Say I have a dataframe df
with P columns where there can be missing data at different rows, e.g. first row data available for column 1 but for for 2, and potentially vice-versa for other rows. I want to select, for each column separately, the column data between first and last valid index (which, again, can differ across columns), and check if there are NaNs left. Then, I would like to exclude those columns that meet such condition.
My code works this way:
for i in df.columns:
df_i = df[i]
trimemd_i = df_i.loc[df_i.first_valid_index():df_i.last_valid_index()]
if np.any(trimmed_i.isnull()):
continue
else:
good_data.append(i)
df = df.loc[:, good_data]
The problem is that the loop is slow if I have many columns. Is there a more efficient way to do it, maybe avoiding loops?
Assuming such an input, in which you want to remove "C":
A B C
0 NaN NaN NaN
1 1.0 1.0 1.0
2 2.0 2.0 NaN
3 3.0 3.0 3.0
4 NaN 4.0 NaN
You can remove the columns with inner NaNs using isna
a double cummax
to identify the inner NaNs, then boolean indexing with all
to select the columns that have no inner NaNs:
tmp = df.notna()
out = df.loc[:, tmp.eq(tmp.cummax() & tmp[::-1].cummax()).all()]
Output:
A B
0 NaN NaN
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 NaN 4.0
Intermediates:
# getting NaNs
# tmp
A B C
0 False False False
1 True True True
2 True True False
3 True True True
4 False True False
# removing inner NaNs
# tmp.cummax() & tmp[::-1].cummax()
A B C
0 False False False
1 True True True
2 True True True
3 True True True
4 False True False
# identifying inner NaNs (as False)
# tmp.eq(tmp.cummax() & tmp[::-1].cummax())
A B C
0 True True True
1 True True True
2 True True False
3 True True True
4 True True True
# columns to keep (no inner NaNs)
# tmp.eq(tmp.cummax() & tmp[::-1].cummax()).all()
A True
B True
C False
dtype: bool