I have a very large pandas DataFrame (>100 million rows, and >1000s of columns). Each row has a unique label as index, for most of the rows, only one column contains value. I want to make a new DataFrame by deleting those rows with only one of the columns has value, and keeping those rows that with more than two columns have values.
You can drop them using dropna
:
In [3]:
#sample df
df = pd.DataFrame({'a':[0,NaN, 2,3,4], 'b':[0,NaN, 2,3,NaN], 'c':arange(5)})
df
Out[3]:
a b c
0 0 0 0
1 NaN NaN 1
2 2 2 2
3 3 3 3
4 4 NaN 4
In [5]:
# drop just the rows which have 2 or more NaN values
df.dropna(thresh=2, axis=0)
Out[5]:
a b c
0 0 0 0
2 2 2 2
3 3 3 3
4 4 NaN 4
You pass the params thresh=2
to specify that you require at least 2 non-NA values, and axis=0
will specify that the criteria should be applied row-wise.