my code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
column_names = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hrs-per-week","native-country","income"]
adult_train = pd.read_csv("adult.data",header=None,sep=',\s',na_values=["?"])
adult_train.columns=column_names
adult_train.fillna('NA',inplace=True)
I want the index of the rows which have the value 'NA' in more than one column. Is there an inbuilt method or I have to iterate row wise and check values at each column?
here is the snapshot of the data:
I want index of rows like 398,409(missing values at column B and G) and not of rows like 394(missing value only at column N)
Use isnull.any(1)
or sum
to get the boolean mask, then select the rows to get the index i.e
df = pd.DataFrame({'A':[1,2,3,4,5],
'B' :[np.nan,4,5,np.nan,8],
'C' :[2,4,np.nan,3,5],
'D' :[np.nan,np.nan,np.nan,np.nan,5]})
A B C D
0 1 NaN 2.0 NaN
1 2 4.0 4.0 NaN
2 3 5.0 NaN NaN
3 4 NaN 3.0 NaN
4 5 8.0 5.0 5.0
# If you want to select rows with nan value from Columns B and C
df.loc[df[['B','C']].isnull().any(1)].index
Int64Index([0, 2, 3], dtype='int64')
# If you want to rows with more than one nan then
df.loc[df.isnull().sum(1)>1].index
Int64Index([0, 2, 3], dtype='int64')