Search code examples
pythonpandasdataframenumpylabel

Assign labels from columns in dataframe


I have the following problem, from a 5 column dataframe I need to assign a label to it but only if any of the columns has more than 2 non nan values, for example for the following dataframe:

Name   Orange  Blue  Red  Black
A       nan     nan   ok   nan
A       nan     nan  nan   nan
A       nan     ok    ok   nan
A        ok     nan   ok    ok

In this case A would have to be label as Red, I tried with the elimination of the columns with only nan values:

df[df[~df.isnull().all()]]

but i can't assign the label properly, any idea to optimize the process? the output can be a df with the column with more non nan vales like:

Name   Red  
A      ok   
A      nan   
A      ok   
A      ok   

Solution

  • Here is one way to do it by using notna and boolean indexing :

    N = 2 # <- adjust here the threshold
    
    sl = df.set_index("Name").notna().sum().loc[lambda s: s.gt(N)]
    ​
    out = df[["Name"]].join(df.iloc[:, sl])
    

    Another variant (better one) :

    out = df.loc[:, df.notna().sum() >= N]
    

    Output :

    print(out)
    
      Name  Red
    0    A   ok
    1    A  NaN
    2    A   ok
    3    A   ok