Search code examples
pythonpandasdataframenumpydata-analysis

How to replace missing data in a column with the daily average based on dates


I have this dataframe:

            DATE  PRCP
0     1949-01-01    25
1     1949-01-02     5
2     1949-01-03     0
3     1949-01-04     0
4     1949-01-05     0
...          ...   ...
20952 2016-01-27     0
20953 2016-01-28     0
20954 2016-01-29    76
20955 2016-01-30     3
20956 2016-01-31     0

Values that are missing have the value -9999.

I need to calculate the sum of PRCP by year. To do that i need to replace the missing data (-9999) in each row with the daily average (the PRCP average for each day of the year from 1949 to 2016 - 366 values)

I created a new Series to calculate the daily mean that looks like this:

DATE
01-01    32.481481
01-02    37.703704
01-03    21.557692
01-04    30.230769
01-05    30.250000
           ...    
12-27    22.980769
12-28    54.076923
12-29    42.685185
12-30    25.553571
12-31    15.000000
Name: PRCP, Length: 366, dtype: float64

Based on this code:

daily_mean = df[df['PRCP'] != -9999].groupby(df['DATE'].dt.strftime('%m-%d'))['PRCP'].mean()

All i need is a way to replace every -9999 in the PRCP column in the original dataframe with the matching value in the daily mean based on the date of the row, so I can get a more accurate total PRCP for each year. Any ideas on how to do that in an efficient and clean way?

To try and solve this I wrote a basic for loop to replace the values:

for i, row in df.iterrows():
    if df.loc[i, 'PRCP'] == -9999:
        date_str = f"{df.loc[i, 'DATE'].month}-{df.loc[i, 'DATE'].day}"
        df.loc[i, 'PRCP'] = daily_mean[date_str]

And then i calculated the yearly sum like this:

prcp_sum_year = df.groupby(df['DATE'].dt.year)['PRCP'].sum()

But I don't want to iterate over the dataframe in a for loop as I'm sure there is a better way.


Solution

  • Use:

    print (df)
                 DATE  PRCP
    0      1949-01-01    25
    1      1949-01-02     5
    2      1949-01-03 -9999
    3      1949-01-04     0
    4      1949-01-05     0
    20952  2016-01-27     0
    20953  2016-01-28     0
    20954  2016-01-03    76
    20955  2017-01-03     3
    20956  2016-01-31     0
    
    #convert values to datetimes
    df['DATE'] = pd.to_datetime(df['DATE'])
    
    #test values for replace
    m = df['PRCP'].eq(-9999)
    #convert to MM-DD Series
    s  = df['DATE'].dt.strftime('%m-%d')
    
    #for replaced values map mean per groups
    df.loc[m, 'PRCP'] = s[m].map(df[~m].groupby(s)['PRCP'].mean())
    

    print (df)
                DATE  PRCP
    0     1949-01-01  25.0
    1     1949-01-02   5.0
    2     1949-01-03  39.5
    3     1949-01-04   0.0
    4     1949-01-05   0.0
    20952 2016-01-27   0.0
    20953 2016-01-28   0.0
    20954 2016-01-03  76.0
    20955 2017-01-03   3.0
    20956 2016-01-31   0.0