With the following data:
idx_a = pd.date_range(start="2000-01-01 00:00:00", periods=5, freq="H")
idx_b = pd.date_range(start="2000-01-01 00:05:00", periods=5, freq="H")
idx_c = pd.date_range(start="2000-01-02 00:00:00", periods=5, freq="H")
idx_d = pd.date_range(start="2000-01-02 00:05:00", periods=5, freq="H")
df = pd.DataFrame({'article: ['a', 'b']*10 , 'view': range(1,21) }, index= idx_a.union(idx_b).union(idx_c).union(idx_d))
article view
2000-01-01 00:00:00 a 1
2000-01-01 00:05:00 b 2
2000-01-01 01:00:00 a 3
2000-01-01 01:05:00 b 4
2000-01-01 02:00:00 a 5
2000-01-01 02:05:00 b 6
2000-01-01 03:00:00 a 7
2000-01-01 03:05:00 b 8
2000-01-01 04:00:00 a 9
2000-01-01 04:05:00 b 10
2000-01-02 00:00:00 a 11
2000-01-02 00:05:00 b 12
2000-01-02 01:00:00 a 13
2000-01-02 01:05:00 b 14
2000-01-02 02:00:00 a 15
2000-01-02 02:05:00 b 16
2000-01-02 03:00:00 a 17
2000-01-02 03:05:00 b 18
2000-01-02 04:00:00 a 19
2000-01-02 04:05:00 b 20
I'd like to update only 2am's view
value with 3 am's view
value for each article each day. So, the desired results should look like this ("<==" indicates rows updated):
article view
2000-01-01 00:00:00 a 1
2000-01-01 00:05:00 b 2
2000-01-01 01:00:00 a 3
2000-01-01 01:05:00 b 4
2000-01-01 02:00:00 a 7 <===
2000-01-01 02:05:00 b 8 <===
2000-01-01 03:00:00 a 7
2000-01-01 03:05:00 b 8
2000-01-01 04:00:00 a 9
2000-01-01 04:05:00 b 10
2000-01-02 00:00:00 a 11
2000-01-02 00:05:00 b 12
2000-01-02 01:00:00 a 13
2000-01-02 01:05:00 b 14
2000-01-02 02:00:00 a 17 <===
2000-01-02 02:05:00 b 18 <===
2000-01-02 03:00:00 a 17
2000-01-02 03:05:00 b 18
2000-01-02 04:00:00 a 19
2000-01-02 04:05:00 b 20
After numerous attempts, I got the closest result with this code:
df.groupby([pd.Grouper(freq="D"), 'article']).view.transform(lambda s: s.where( ~(s.index.hour==2), s[s.index.hour==3]))
However, np.nan
shows in the cells that I expected to be updated. Interestingly, if I replace s[s.index.hour==3]
with an integer, the cells (i.e., 2am values) are correctly updated with the integer. How can I get the 3am's value of each article and use it to update 2am's value in a given day?
If only difference is in hour
s is possible filter rows with replace hour
s and then use DataFrame.update
:
df1 = df[df.index.hour==3].rename(lambda x: x.replace(hour=2))
print (df1)
article view
2000-01-01 02:00:00 a 7
2000-01-01 02:05:00 b 8
2000-01-02 02:00:00 a 17
2000-01-02 02:05:00 b 18
df.update(df1)
print (df)
article view
2000-01-01 00:00:00 a 1.0
2000-01-01 00:05:00 b 2.0
2000-01-01 01:00:00 a 3.0
2000-01-01 01:05:00 b 4.0
2000-01-01 02:00:00 a 7.0
2000-01-01 02:05:00 b 8.0
2000-01-01 03:00:00 a 7.0
2000-01-01 03:05:00 b 8.0
2000-01-01 04:00:00 a 9.0
2000-01-01 04:05:00 b 10.0
2000-01-02 00:00:00 a 11.0
2000-01-02 00:05:00 b 12.0
2000-01-02 01:00:00 a 13.0
2000-01-02 01:05:00 b 14.0
2000-01-02 02:00:00 a 17.0
2000-01-02 02:05:00 b 18.0
2000-01-02 03:00:00 a 17.0
2000-01-02 03:05:00 b 18.0
2000-01-02 04:00:00 a 19.0
2000-01-02 04:05:00 b 20.0
Your solution should be changed with convert values to list, also instead invert mask is use !=
:
df['view1'] = (df.groupby([pd.Grouper(freq="D"), 'article']).view
.transform(lambda s: s.where(s.index.hour!=2,s[s.index.hour==3].tolist())))
print (df)
article view view1
2000-01-01 00:00:00 a 1 1
2000-01-01 00:05:00 b 2 2
2000-01-01 01:00:00 a 3 3
2000-01-01 01:05:00 b 4 4
2000-01-01 02:00:00 a 5 7
2000-01-01 02:05:00 b 6 8
2000-01-01 03:00:00 a 7 7
2000-01-01 03:05:00 b 8 8
2000-01-01 04:00:00 a 9 9
2000-01-01 04:05:00 b 10 10
2000-01-02 00:00:00 a 11 11
2000-01-02 00:05:00 b 12 12
2000-01-02 01:00:00 a 13 13
2000-01-02 01:05:00 b 14 14
2000-01-02 02:00:00 a 15 17
2000-01-02 02:05:00 b 16 18
2000-01-02 03:00:00 a 17 17
2000-01-02 03:05:00 b 18 18
2000-01-02 04:00:00 a 19 19
2000-01-02 04:05:00 b 20 20