Search code examples
pythonexcelpandasdataframemissing-data

pandas.read_excel() na_values not working correctly


As title states, after reviewing docs

I am reading an .xlsx file, with a column 'HOUR' which has many values, when an instance has value 99, i want to convert to None

I have tried the na_values param with different values:

na_values = ['99'] 
na_values = [r'99']
na_values = 99
...

To then read the excel like this:

accidents = pd.read_excel(filename, sheet_name= 'datos', na_values=[99])

but it doesn't seem to work, when i do:

np.sum(accidents['HOUR'] == 99)

I get a value > than 0 (which means the instances with value = 99 have not been transformed to None/NaN)

I have also read that i should include the option

keep_default_na=False

but to no avail.

The values of the 'HOUR' column are:

accidents['HOUR'].unique()
array([ 8, 15,  9, 14, 11,  0, 13, 20,  3, 19, 17,  7, 22, 21, 16,  6, 23,
   18, 10, 12,  1, 99,  4,  5,  2, 24], dtype=int64)

I have updated my pandas version to 1.5.1 and it still doesn't work, any ideas why?

.xls file can be found in: http://www.transtats.bts.gov/Fields.asp?Table_ID=1158

Thank you


Solution

  • Just apply replace method on the dataframe after reading the excel file:

    df.replace(99, np.nan)
    

    If you want to replace values for only specific column like Hour:

    df['HOUR'].replace(99, np.nan)
    

    Update:

    I think you want to know why read_excel() method isn't working with the na values you provided, if you check the documentation for the method:

    na_values : scalar, str, list-like, or dict, default None Additional strings to recognize as NA/NaN.

    It only accept strings in the na_values paramater, so you need to pass it as string '99' in order to work in your case.

    For some reasons it's not working for integer na_values in excel sheets. according to docs read_excel method column types are inferred but can be explicitly specified. maybe it doesn't set HOUR col type correctly so the na_values not working. I found a work around by specifying column data type in the method explicitly and it worked perfectly:

    accidents = pd.read_excel(filename, sheet_name= 'datos', dtype={'HOUR': str}, na_values=['99'])
    
    accidents = pd.read_excel(filename, sheet_name= 'datos', dtype={'HOUR': str}, na_values=[99])