Search code examples
pythonpandasdata-analysis

How to remove all rows of a datframe column that contain a question mark instead of occupation


This is my attempt:

df['occupation']= df['occupation'].str.replace('?',  '')

df.dropna(subset=['occupation'], inplace=True)

but it is not working, How do i remove all of the rows of the occupation column that i read from a csv file that contain a ? rather than an occupation


Solution

  • Clean up the white space and use an 'unselect' filter:

    import pandas as pd
    bugs = ['grasshopper','cricket','ant','spider']
    fruit = ['lemon','komquat','watermelon','apple']
    squashed = ['  ?   ','Yes','No','Eww']
    
    df = pd.DataFrame(list(zip(bugs,fruit,squashed)), columns = ['Bugs','Fruit','Squashed'])
    print(df.head())
    

    enter image description here

    df = df[df['Squashed'].apply(lambda x: x.strip()) != '?']
    print('after stripping white space and after unselect')
    print(df.head())
    

    enter image description here

    Why

    The dataframe method .dropna() won't detect blanks (ie '') but will look for Nan or NaT or None.

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

    However, using .replace() to set the value to missing won't work because .replace() requires the type to match and None doesn't match any type you'll have in a column already.

    Better to clean up the white space (which is the simple case) using lambda on each entry to apply the string transformation.