I have this dataframe:
DATE PRCP
0 1949-01-01 25
1 1949-01-02 5
2 1949-01-03 0
3 1949-01-04 0
4 1949-01-05 0
... ... ...
20952 2016-01-27 0
20953 2016-01-28 0
20954 2016-01-29 76
20955 2016-01-30 3
20956 2016-01-31 0
Values that are missing have the value -9999.
I need to calculate the sum of PRCP by year. To do that i need to replace the missing data (-9999) in each row with the daily average (the PRCP average for each day of the year from 1949 to 2016 - 366 values)
I created a new Series to calculate the daily mean that looks like this:
DATE
01-01 32.481481
01-02 37.703704
01-03 21.557692
01-04 30.230769
01-05 30.250000
...
12-27 22.980769
12-28 54.076923
12-29 42.685185
12-30 25.553571
12-31 15.000000
Name: PRCP, Length: 366, dtype: float64
Based on this code:
daily_mean = df[df['PRCP'] != -9999].groupby(df['DATE'].dt.strftime('%m-%d'))['PRCP'].mean()
All i need is a way to replace every -9999 in the PRCP column in the original dataframe with the matching value in the daily mean based on the date of the row, so I can get a more accurate total PRCP for each year. Any ideas on how to do that in an efficient and clean way?
To try and solve this I wrote a basic for loop to replace the values:
for i, row in df.iterrows():
if df.loc[i, 'PRCP'] == -9999:
date_str = f"{df.loc[i, 'DATE'].month}-{df.loc[i, 'DATE'].day}"
df.loc[i, 'PRCP'] = daily_mean[date_str]
And then i calculated the yearly sum like this:
prcp_sum_year = df.groupby(df['DATE'].dt.year)['PRCP'].sum()
But I don't want to iterate over the dataframe in a for loop as I'm sure there is a better way.
Use:
print (df)
DATE PRCP
0 1949-01-01 25
1 1949-01-02 5
2 1949-01-03 -9999
3 1949-01-04 0
4 1949-01-05 0
20952 2016-01-27 0
20953 2016-01-28 0
20954 2016-01-03 76
20955 2017-01-03 3
20956 2016-01-31 0
#convert values to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
#test values for replace
m = df['PRCP'].eq(-9999)
#convert to MM-DD Series
s = df['DATE'].dt.strftime('%m-%d')
#for replaced values map mean per groups
df.loc[m, 'PRCP'] = s[m].map(df[~m].groupby(s)['PRCP'].mean())
print (df)
DATE PRCP
0 1949-01-01 25.0
1 1949-01-02 5.0
2 1949-01-03 39.5
3 1949-01-04 0.0
4 1949-01-05 0.0
20952 2016-01-27 0.0
20953 2016-01-28 0.0
20954 2016-01-03 76.0
20955 2017-01-03 3.0
20956 2016-01-31 0.0