Search code examples
pythonpandascsvdataframekaggle

Searching a Pandas DataFrame column for empty values gives contradictory results


I'm trying to clean test data from Kaggle's Titanic dataset, specifically the columns - sex, fare, pclass, and age. In order to do this, I'd like to find out if any of these columns have empty values. I load the data:

import pandas as pd
test_address = 'path_to_data\test.csv'
test = pd.read_csv(test_address)

When I try to look for empty values in the columns,

True in test['Sex'].isna()

outputs True.

However,

test['Sex'].isna().value_counts()

outputs

False    418
Name: Sex, dtype: int64

which should mean there aren't any empty values (I could confirm this by visually scanning the csv file). These commands on test['Pclass'] have similar outputs.

Why is the 'True in' command giving me the wrong answer?


Solution

  • The operator in, when applied to a Series, checks if its left operand is in the index of the right operand. Since there is a row #1 (the numeric representation of True) in the series, the operator evaluates to True.

    For the same reason False in df['Sex'].isna() is True, but False in df['Sex'][1:].isna() is False (there is no row #0 in the latter slice).

    You should check if True in df['Sex'].isna().values.