Search code examples
pythonpython-3.xpandasdataframereplace

Strange behaviour with str.replace() in Pandas dataframe. Deleting values that don't match target string


I have a dataframe of weather data that I am reading in from a csv file and two of the columns 'Sea_Level_Pressure' and 'Wind_Speed' have numeric values that are suffixed with an 's' I would like to remove. However when I use:

df['Sea_Level_Pressure'] = df['Sea_Level_Pressure'].str.replace('s','')
df['Wind_Speed'] = df['Wind_Speed'].str.replace('s','')

The result is that for the first half of the rows the 'Sea_Level_Pressure' values are replaced with null and at the same row in the dataframe in the last half of the rows the values for 'Wind_Speed' are replaced with null. The data type for both columns is object.

Here is sample code that will download the csv from NOAA and print a csv before and after applying str.replace. The break in null values for the two columns happens at 2020-09-09 16:52 as you can see in the second csv file that is output.

import pandas as pd

url = 'https://www.ncei.noaa.gov/data/local-climatological-data/access/2020/72530594892.csv'
df = pd.read_csv(url)
df = df[df.REPORT_TYPE == 'FM-15']
df = df[['DATE', 'HourlyDryBulbTemperature','HourlyRelativeHumidity','HourlySeaLevelPressure','HourlyWindSpeed','HourlyPrecipitation']]
df.rename(columns={'HourlyDryBulbTemperature': 'Temp_F', 'HourlyRelativeHumidity':'Rel_Humidity', 'HourlySeaLevelPressure':'Sea_Level_Pressure','HourlyWindSpeed':'Wind_Speed','HourlyPrecipitation':'Precip'}, inplace=True)

df.to_csv('weather_bf_replace.csv', index=False)

df['Sea_Level_Pressure'] = df['Sea_Level_Pressure'].str.replace('s','')
df['Wind_Speed'] = df['Wind_Speed'].str.replace('s','')

df.to_csv('weather_after_replace.csv',index=False)

Interestingly, if I save the df to a temp csv prior to doing the str.replace and then read the temp csv back into a df and apply str.replace to that dataframe it works fine. I tried adding the str.replace to the original dataframe right after reading the csv and I get the same behaviour so the few lines of filtering and renameing the columns is not causing the issue. I also examined the original csv file around the datetime where the break occurs and there is nothing unusual in the data.

Thanks in advance for the help. I am at my wits end with this.


Solution

  • The reason is that your columns contain mixed types, probably because of how the data is stored in the original CSV.

    print(set([type(x) for x in df['Sea_Level_Pressure']]))
    

    will result in {float, str}

    However, str.replace is a method that works on the str type. To fix this, convert your column into str first and then back to float:

    df = df['Sea_Level_Pressure'].astype(str).str.replace('s','').astype(float)
    

    Alternatively, specify the dtypes of the columns explicitly during import.