Search code examples
pythonpandaspython-applymap

replace columns with NaN based on a string in pandas


I have the below data frame

ipdb> csv_data
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal    Set Null     20090101                     30
3         madhya      355883     20090101                     40
4          sudan    Set Null     20090101                     50

I want to replace all the column values that contain Set Null to Nan and so i approached below way

import numpy

def set_NaN(element):
    if element == 'Set Null':
        return numpy.nan
    else:
        return element

csv_data = csv_data.applymap(lambda element: set_NaN(element))

But it does not changes anything

ipdb> print csv_data
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal    Set Null     20090101                     30
3         madhya      355883     20090101                     40
4          sudan    Set Null     20090101                     50
ipdb>

But when i print only csv_data.applymap(lambda element: set_NaN(element)) as below i can see the output, but when assigned back i can't get the data i intended to

ipdb> csv_data.applymap(lambda element: set_NaN(element))
  country_edited sale_edited  date_edited  transformation_edited
0          India      403171     20090101                     10
1         Bhutan      394096     20090101                     20
2          Nepal         NaN     20090101                     30
3         madhya      355883     20090101                     40
4          sudan         NaN     20090101                     50

So how to replace the column values with NaN based on certain string ?


Solution

  • You need DataFrame.mask, it replace True values of mask by NaN. Also some columns are numeric, so need cast values of df to string first:

    print (csv_data.astype(str) == 'Set Null')
      country_edited sale_edited date_edited transformation_edited
    0          False       False       False                 False
    1          False       False       False                 False
    2          False        True       False                 False
    3          False       False       False                 False
    4          False        True       False                 False
    
    
    csv_data = csv_data.mask(csv_data.astype(str) == 'Set Null')
    print (csv_data)
      country_edited sale_edited  date_edited  transformation_edited
    0          India      403171     20090101                     10
    1         Bhutan      394096     20090101                     20
    2          Nepal         NaN     20090101                     30
    3         madhya      355883     20090101                     40
    4          sudan         NaN     20090101                     50
    

    Another solution with numpy boolean mask - compare numpy array by DataFrame.values:

    print (csv_data.values == 'Set Null')
    [[False False False False]
     [False False False False]
     [False  True False False]
     [False False False False]
     [False  True False False]]
    
    csv_data = csv_data.mask(csv_data.values == 'Set Null')
    print (csv_data)
      country_edited sale_edited  date_edited  transformation_edited
    0          India      403171     20090101                     10
    1         Bhutan      394096     20090101                     20
    2          Nepal         NaN     20090101                     30
    3         madhya      355883     20090101                     40
    4          sudan         NaN     20090101                     50
    

    In your solution is necessary assign data back to csv_data:

    def set_NaN(element):
        if element == 'Set Null':
            return numpy.nan
        else:
            return element
    
    csv_data = csv_data.applymap(lambda element: set_NaN(element))
    print (csv_data)
      country_edited sale_edited  date_edited  transformation_edited
    0          India      403171     20090101                     10
    1         Bhutan      394096     20090101                     20
    2          Nepal         NaN     20090101                     30
    3         madhya      355883     20090101                     40
    4          sudan         NaN     20090101                     50