Search code examples
pythonpandasdataframetransform

Pandas dataframe - update the values of certain rows based on the condition of a groupby object


With the following data:

idx_a = pd.date_range(start="2000-01-01 00:00:00", periods=5, freq="H")
idx_b = pd.date_range(start="2000-01-01 00:05:00", periods=5, freq="H")
idx_c = pd.date_range(start="2000-01-02 00:00:00", periods=5, freq="H")
idx_d = pd.date_range(start="2000-01-02 00:05:00", periods=5, freq="H")

df = pd.DataFrame({'article: ['a', 'b']*10 , 'view': range(1,21) }, index= idx_a.union(idx_b).union(idx_c).union(idx_d))
                     article    view
2000-01-01 00:00:00        a    1
2000-01-01 00:05:00        b    2
2000-01-01 01:00:00        a    3
2000-01-01 01:05:00        b    4
2000-01-01 02:00:00        a    5
2000-01-01 02:05:00        b    6
2000-01-01 03:00:00        a    7
2000-01-01 03:05:00        b    8
2000-01-01 04:00:00        a    9
2000-01-01 04:05:00        b    10
2000-01-02 00:00:00        a    11
2000-01-02 00:05:00        b    12
2000-01-02 01:00:00        a    13
2000-01-02 01:05:00        b    14
2000-01-02 02:00:00        a    15
2000-01-02 02:05:00        b    16
2000-01-02 03:00:00        a    17
2000-01-02 03:05:00        b    18
2000-01-02 04:00:00        a    19
2000-01-02 04:05:00        b    20

I'd like to update only 2am's view value with 3 am's view value for each article each day. So, the desired results should look like this ("<==" indicates rows updated):

                     article    view
2000-01-01 00:00:00        a    1
2000-01-01 00:05:00        b    2
2000-01-01 01:00:00        a    3
2000-01-01 01:05:00        b    4
2000-01-01 02:00:00        a    7   <===
2000-01-01 02:05:00        b    8   <===
2000-01-01 03:00:00        a    7
2000-01-01 03:05:00        b    8
2000-01-01 04:00:00        a    9
2000-01-01 04:05:00        b    10
2000-01-02 00:00:00        a    11
2000-01-02 00:05:00        b    12
2000-01-02 01:00:00        a    13
2000-01-02 01:05:00        b    14
2000-01-02 02:00:00        a    17  <===
2000-01-02 02:05:00        b    18  <===
2000-01-02 03:00:00        a    17
2000-01-02 03:05:00        b    18
2000-01-02 04:00:00        a    19
2000-01-02 04:05:00        b    20

After numerous attempts, I got the closest result with this code:

df.groupby([pd.Grouper(freq="D"), 'article']).view.transform(lambda s: s.where( ~(s.index.hour==2), s[s.index.hour==3]))

However, np.nan shows in the cells that I expected to be updated. Interestingly, if I replace s[s.index.hour==3] with an integer, the cells (i.e., 2am values) are correctly updated with the integer. How can I get the 3am's value of each article and use it to update 2am's value in a given day?


Solution

  • If only difference is in hours is possible filter rows with replace hours and then use DataFrame.update:

    df1 = df[df.index.hour==3].rename(lambda x: x.replace(hour=2))
    print (df1)                   
                        article  view
    2000-01-01 02:00:00       a     7
    2000-01-01 02:05:00       b     8
    2000-01-02 02:00:00       a    17
    2000-01-02 02:05:00       b    18
    
    df.update(df1)
    print (df)
                        article  view
    2000-01-01 00:00:00       a   1.0
    2000-01-01 00:05:00       b   2.0
    2000-01-01 01:00:00       a   3.0
    2000-01-01 01:05:00       b   4.0
    2000-01-01 02:00:00       a   7.0
    2000-01-01 02:05:00       b   8.0
    2000-01-01 03:00:00       a   7.0
    2000-01-01 03:05:00       b   8.0
    2000-01-01 04:00:00       a   9.0
    2000-01-01 04:05:00       b  10.0
    2000-01-02 00:00:00       a  11.0
    2000-01-02 00:05:00       b  12.0
    2000-01-02 01:00:00       a  13.0
    2000-01-02 01:05:00       b  14.0
    2000-01-02 02:00:00       a  17.0
    2000-01-02 02:05:00       b  18.0
    2000-01-02 03:00:00       a  17.0
    2000-01-02 03:05:00       b  18.0
    2000-01-02 04:00:00       a  19.0
    2000-01-02 04:05:00       b  20.0
    

    Your solution should be changed with convert values to list, also instead invert mask is use !=:

    df['view1'] = (df.groupby([pd.Grouper(freq="D"), 'article']).view
                    .transform(lambda s: s.where(s.index.hour!=2,s[s.index.hour==3].tolist())))
    print (df)
                        article  view  view1
    2000-01-01 00:00:00       a     1      1
    2000-01-01 00:05:00       b     2      2
    2000-01-01 01:00:00       a     3      3
    2000-01-01 01:05:00       b     4      4
    2000-01-01 02:00:00       a     5      7
    2000-01-01 02:05:00       b     6      8
    2000-01-01 03:00:00       a     7      7
    2000-01-01 03:05:00       b     8      8
    2000-01-01 04:00:00       a     9      9
    2000-01-01 04:05:00       b    10     10
    2000-01-02 00:00:00       a    11     11
    2000-01-02 00:05:00       b    12     12
    2000-01-02 01:00:00       a    13     13
    2000-01-02 01:05:00       b    14     14
    2000-01-02 02:00:00       a    15     17
    2000-01-02 02:05:00       b    16     18
    2000-01-02 03:00:00       a    17     17
    2000-01-02 03:05:00       b    18     18
    2000-01-02 04:00:00       a    19     19
    2000-01-02 04:05:00       b    20     20