Search code examples
pythonpython-3.xpython-2.7pandassklearn-pandas

pandas return index of rows having more than one 'NA' value


my code:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
column_names = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hrs-per-week","native-country","income"]

adult_train = pd.read_csv("adult.data",header=None,sep=',\s',na_values=["?"])
adult_train.columns=column_names
adult_train.fillna('NA',inplace=True)

I want the index of the rows which have the value 'NA' in more than one column. Is there an inbuilt method or I have to iterate row wise and check values at each column? here is the snapshot of the data:data

I want index of rows like 398,409(missing values at column B and G) and not of rows like 394(missing value only at column N)


Solution

  • Use isnull.any(1) or sum to get the boolean mask, then select the rows to get the index i.e

    df = pd.DataFrame({'A':[1,2,3,4,5],
                   'B' :[np.nan,4,5,np.nan,8],
                   'C' :[2,4,np.nan,3,5],
                   'D' :[np.nan,np.nan,np.nan,np.nan,5]})
    
       A    B    C    D
    0  1  NaN  2.0  NaN
    1  2  4.0  4.0  NaN
    2  3  5.0  NaN  NaN
    3  4  NaN  3.0  NaN
    4  5  8.0  5.0  5.0
    
    # If you want to select rows with nan value from Columns B and C 
    df.loc[df[['B','C']].isnull().any(1)].index
    Int64Index([0, 2, 3], dtype='int64')
    
    # If you want to rows with more than one nan then
    df.loc[df.isnull().sum(1)>1].index
    Int64Index([0, 2, 3], dtype='int64')