I'm trying to clean test data from Kaggle's Titanic dataset, specifically the columns - sex, fare, pclass, and age. In order to do this, I'd like to find out if any of these columns have empty values. I load the data:
import pandas as pd
test_address = 'path_to_data\test.csv'
test = pd.read_csv(test_address)
When I try to look for empty values in the columns,
True in test['Sex'].isna()
outputs True.
However,
test['Sex'].isna().value_counts()
outputs
False 418
Name: Sex, dtype: int64
which should mean there aren't any empty values (I could confirm this by visually scanning the csv file). These commands on test['Pclass']
have similar outputs.
Why is the 'True in' command giving me the wrong answer?
The operator in
, when applied to a Series, checks if its left operand is in the index of the right operand. Since there is a row #1 (the numeric representation of True
) in the series, the operator evaluates to True.
For the same reason False in df['Sex'].isna()
is True
, but False in df['Sex'][1:].isna()
is False
(there is no row #0 in the latter slice).
You should check if True in df['Sex'].isna().values
.