Search code examples
pandasdataframeslicemissing-data

slicing pandas columns individually between first and last valid index


Say I have a dataframe df with P columns where there can be missing data at different rows, e.g. first row data available for column 1 but for for 2, and potentially vice-versa for other rows. I want to select, for each column separately, the column data between first and last valid index (which, again, can differ across columns), and check if there are NaNs left. Then, I would like to exclude those columns that meet such condition.

My code works this way:

for i in df.columns:
    df_i = df[i]
    trimemd_i = df_i.loc[df_i.first_valid_index():df_i.last_valid_index()]
    if np.any(trimmed_i.isnull()):
        continue
    else:
        good_data.append(i)
df = df.loc[:, good_data]

The problem is that the loop is slow if I have many columns. Is there a more efficient way to do it, maybe avoiding loops?


Solution

  • Assuming such an input, in which you want to remove "C":

         A    B    C
    0  NaN  NaN  NaN
    1  1.0  1.0  1.0
    2  2.0  2.0  NaN
    3  3.0  3.0  3.0
    4  NaN  4.0  NaN
    

    You can remove the columns with inner NaNs using isna a double cummax to identify the inner NaNs, then boolean indexing with all to select the columns that have no inner NaNs:

    tmp = df.notna()
    
    out = df.loc[:, tmp.eq(tmp.cummax() & tmp[::-1].cummax()).all()]
    

    Output:

         A    B
    0  NaN  NaN
    1  1.0  1.0
    2  2.0  2.0
    3  3.0  3.0
    4  NaN  4.0
    

    Intermediates:

    # getting NaNs
    # tmp
           A      B      C
    0  False  False  False
    1   True   True   True
    2   True   True  False
    3   True   True   True
    4  False   True  False
    
    # removing inner NaNs
    # tmp.cummax() & tmp[::-1].cummax()
           A      B      C
    0  False  False  False
    1   True   True   True
    2   True   True   True
    3   True   True   True
    4  False   True  False
    
    # identifying inner NaNs (as False)
    # tmp.eq(tmp.cummax() & tmp[::-1].cummax())
          A     B      C
    0  True  True   True
    1  True  True   True
    2  True  True  False
    3  True  True   True
    4  True  True   True
    
    # columns to keep (no inner NaNs)
    # tmp.eq(tmp.cummax() & tmp[::-1].cummax()).all()
    A     True
    B     True
    C    False
    dtype: bool