Search code examples
rdataframelabelgroupingsequential

Trying to label sequentially within groups of dataframe R


I have a subset of my dataframe:

df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
                store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
                quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804, 
                            2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))

where 2017010104 corresponds to the date 01/01/2017 and so on. I am trying to label these different quad_ids sequentially with reference to the year. So for example I am trying to get the output:

df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
                store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
                quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804, 
                            2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204),
                Snum = c(1,  2,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 1, 2, 3, 4))

where you can see for retailer_id = 2, store_id = 167, the weeks for the year 2017 are labeled 1-14 and then when the week begins with 2018 it starts counting sequentially from 1 again until it will reach a week that starts with 2019 within this grouping.

I tried:

DT <- data.table(df)

DT[, Snum := seq_len(.N), by = list(retailer_id, store_id)]

However, this is not labeling sequentially by year, instead it is labelling sequentially by store_id. Is there a way to fix this? (this example code is only showing two retailers and two stores, whereas my actual dataframe and hundreds of different retailers and stores)


Solution

  • Here's a solution using tidyverse

    df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
                    store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
                    quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804, 
                                2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))
    
    library(tidyverse)
    
    getYear = function(x) {
      x %>% 
        str_extract("^\\d{4}") %>% 
        as.integer() %>% 
        return()
    }
    
    tmp = df %>% 
      mutate(year = getYear(quad_id)) %>% 
      group_by(year, retailer_id, store_id) %>% 
      mutate(Snum = 1:n())
    
    > tmp
    # A tibble: 20 x 5
    # Groups:   year, retailer_id, store_id [3]
       retailer_id store_id    quad_id  year  Snum
             <dbl>    <dbl>      <dbl> <int> <int>
     1           1      166 2017010104  2017     1
     2           1      166 2017012904  2017     2
     3           2      167 2017010104  2017     1
     4           2      167 2017012904  2017     2
     5           2      167 2017022604  2017     3
     6           2      167 2017032604  2017     4
     7           2      167 2017042304  2017     5
     8           2      167 2017052104  2017     6
     9           2      167 2017061804  2017     7
    10           2      167 2017071604  2017     8
    11           2      167 2017081304  2017     9
    12           2      167 2017091004  2017    10
    13           2      167 2017100804  2017    11
    14           2      167 2017110504  2017    12
    15           2      167 2017120304  2017    13
    16           2      167 2017123104  2017    14
    17           2      167 2018012804  2018     1
    18           2      167 2018022504  2018     2
    19           2      167 2018032504  2018     3
    20           2      167 2018042204  2018     4
    

    Note that if your data isn't sorted by retailer_id, store_id and year that would cause an issue.