I have a series s
that has entries that are lists, for example [1, 2, 3, NaN, NaN]
or [4, 5]
. These lists may contain NaNs as the last few elements, and I want to drop all entires in this series that contain NaN. I have so far used s.transform(lambda x: np.nan if np.isnan(x).any() else x).dropna()
, but this takes over a minute on just 21 million rows, and I am eventually planning on doing this with tens of billions of rows, so I need something fast. Thank you!
To emphasize, each entry in the series is a list, and so I cannot just use pd.dropna()
because there are no entries that are NaN since are lists themselves. I want to delete the lists (entries) that CONTAIN NaN. This is what the series s
might look like: pd.Series([1, 2, 3, NaN, NaN], [4, 5]...)
.
You can identify all index positions that are equal to NaN
for the exploded data frame and can then filter the data frame for those that are not in the index array:
ser = pd.DataFrame(data={"col": [[1, 2, 3, np.nan, np.nan], [3, 4, 5], [3, 9], [np.nan, 10]]})['col']
ser_exploded = ser.explode()
ser[~ser.index.isin(np.unique(ser_exploded[ser_exploded.isna()].index))]
--------------------------------------
1 [3, 4, 5]
2 [3, 9]
Name: col, dtype: object
--------------------------------------