Search code examples
pythonpandasdataframereplacesubstitution

How to create new dataframe that copies all data from another where the new one converted all values beginning with -9 to NaN?


I have a dataframe of 10899 rows × 32 columns where there are many cells containing data that start with -9. e.g. -99.0, -9, -9.678, etc.

How can I create a new dataframe from the original where all values beginning with '-9' are converted to NaN?

The code I used returned a new dataframe with no changes.

Let's call the original dataframe weatherData_original.

Here's a piece of weatherData_original:

    Date    Solar   MaxRH   AvgAirTemp
0   3/1/1983    -9.00   -9.0    -99.00
1   3/2/1983    -9.00   -9.0    0.31
2   3/3/1983    -9.00   -9.0    -99.00
3   3/4/1983    -9.00   -9.0    8.62
4   3/5/1983    19.97   64.6    8.91
... ... ... ... ...
10894   12/27/2012  9.67    53.9    5.99
10895   12/28/2012  10.21   89.7    0.96
10896   12/29/2012  10.25   57.9    1.89
10897   12/30/2012  3.72    86.3    3.85
10898   12/31/2012  9.68    92.9    3.53

Code:

new_df = weatherData_original.replace(regex=r'^-9\d+', value=pd.NA) 

new_df = weatherData_original.replace(regex='^[-9].*', value=pd.NA)

Solution

  • I think your issue here is surrounding data types. You're writing a regular expression to replace substrings, but I imagine pandas is reading in the columns you're interested in as floats or integers. You can check this by running:

    weatherData_original.dtypes
    

    One way around this is to set the dtype to str when loading the data. This will read in every column as a string value.

    data = {'Date': ['3/1/1983', '3/2/1983'], 'Solar': [-9.00, -9.00]}
    df = pd.DataFrame(data, dtype=str)
    

    Then you can run your regular expression replacements. I found your second regular expression to work well for me.

    new_df = df.replace(regex='^[-9].*', value=pd.NA)