Search code examples
rdata.tabledummy-variable

How to add seasonal dummy variables?


I would like to add seasonality dummies in my R data.table based on quarters. I have looked at multiple examples but I haven't been able to solve this issue yet. My knowledge about R is limited so I was wondering if you could get me on the right track.

My data.table looks like this:

    Year_week  artist_id  number_of_events number_of_streams
   1:     16/30    8296         1            957892
   2:     16/33    8296         6            882282
   3:     16/34    8296         5            926037
   4:     16/35    8296         2            952704
   5:     15/37    17879        1             89515
   6:     16/22    22690        2            119653

What I would like to have is a format like this:

 Year_week  artist_id  number_of_events number_of_streams Q2 Q3 Q4
   1:     16/50    8296         1            957892        0  0  1       

Solution

  • Two approaches:

    1) Using dcast, cut and sub:

    dcast(DT[, Q := cut(as.integer(sub('.*/','',Year_week)),
                        breaks = c(0,13,26,39,53),
                        labels = paste0('Q',1:4))],
          Year_week + artist_id + number_of_events + number_of_streams ~ Q,
          value.var = 'Q',
          drop = c(TRUE,FALSE),
          fun = length)
    

    gives:

       Year_week artist_id number_of_events number_of_streams Q1 Q2 Q3 Q4
    1:     15/37     17879                1             89515  0  0  1  0
    2:     16/22     22690                2            119653  0  1  0  0
    3:     16/30      8296                1            957892  0  0  1  0
    4:     16/33      8296                6            882282  0  0  1  0
    5:     16/34      8296                5            926037  0  0  1  0
    6:     16/35      8296                2            952704  0  0  1  0
    

    What this does:

    • as.integer(sub('.*/','',Year_week)) extacts the weeknumber from the Year_week column
    • Using cut you divide that into quarters with the appropriate labels (see also ?cut)
    • With dcast you transpose the quarter column to wide format with a aggregation function (length). By using drop = c(TRUE,FALSE) in the dcast function, you make sure that all quarters are included.

    Notes:

    • The Q-column is an ordered factor, so you can use that to arrange and filter your data as well.
    • Depending on the use of the dummy columns: you don't always need those. When you want to use them as grouping or filtering variables, you can just work with the Q variable.
    • However, some statistical tests require dummy variables (which justifies the dcast step).

    2) Using cut, sub and lapply:

    DT[, Q := cut(as.integer(sub('.*/','',Year_week)),
                  breaks = c(0,13,26,39,53),
                  labels = paste0('Q',1:4))
       ][, paste0('Q',1:4) := lapply(paste0('Q',1:4), function(q) as.integer(q == Q))][]
    

    which gives a similar result. Instead of transposing with dcast, you just check whether one of the quarter labels is in the Q-column.


    Used data:

    DT <- fread(' Year_week  artist_id  number_of_events number_of_streams
         16/30    8296         1            957892
         16/33    8296         6            882282
         16/34    8296         5            926037
         16/35    8296         2            952704
         15/37    17879        1             89515
         16/22    22690        2            119653')