Search code examples
pythonpandasdataframedata-cleaning

Remove row in Dataframe if Contains Question Mark Python 3


I am newbie here so please forgive for any mistakes. I am trying to work on adult census dataset. I am finding it hard to remove the question marks in the dataset.

Link to the dataset :- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

I have also tried the 1st answer in the given link:- Drop rows with a 'question mark' value in any column in a pandas dataframe

But I am getting an error

~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1251 
   1252             with np.errstate(all='ignore'):
-> 1253                 res = na_op(values, other)
   1254             if is_scalar(res):
   1255                 raise TypeError('Could not compare {typ} type with Series'

~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1164                     result = method(y)
   1165                 if result is NotImplemented:
-> 1166                     raise TypeError("invalid type comparison")
   1167             else:
   1168                 result = op(x, y)

TypeError: invalid type comparison

Please tell me how to solve this issue. I am using Python 3.6

Thank You!!

Edit 1:- This is also called Census Income Dataset.


Solution

  • First cast to strings and then filter by boolean indexing:

    df = df[(df.astype(str) != '?').all(axis=1)]
    #alternative solution
    #df = df[~(df.astype(str) == '?').any(axis=1)]
    print (df)
       X  Y  Z
    1  1  2  3
    3  4  4  4
    

    Or compare numpy array:

    df = df[(df.values != '?').all(axis=1)]
    

    Details:

    Compare all converted strings by astype with change condition to !=:

    print (df.astype(str) != '?')
           X      Y      Z
    0   True   True  False
    1   True   True   True
    2  False  False   True
    3   True   True   True
    4  False   True   True
    

    And then check if all True values per row:

    print ((df.astype(str) != '?').all(axis=1))
    0    False
    1     True
    2    False
    3     True
    4    False
    dtype: bool