Search code examples
sequencestatafill

Create a sequence for dates with repeats


I have a list of days (numbered 195-720) and each day has multiple observations. I would ultimately like to determine which of these days are weekdays and which are weekend days. I would be able to do this if I could just assign the digits 1-7 to each of the days. Currently, the data looks like this:

     Day    Household ID    Hour of Day
     195     1                  1
     195     1                  2
     195     1                  3
     195     1                  4
     196     1                  1
     196     1                  2
     196     1                  3
     197     1                  1
     197     1                  2

It is perhaps important to note that there is not a consistent number of observations for each day (e.g. 4 observations for day 195, 3 observations for day 196, 2 observations for day 197).

I know that Day 195 is a Tuesday, which for simplicity's sake I would like to code as equal to "2" (Wednesday=3, Thursday=4, etc).

Thus, I would like to get the following output:

     Day    Household ID    Hour of Day         DAY OF WEEK
     195     1                  1                  2 
     195     1                  2                  2
     195     1                  3                  2
     195     1                  4                  2
     196     1                  1                  3
     196     1                  2                  3
     196     1                  3                  3
     197     1                  1                  4
     197     1                  2                  4

After looking through Stata documentation, I considered using DYM/DMY. However, this does not work because I do not have an original "date" variable to work from. Instead, I just have a number "195" which corresponds to Tuesday, July 12.

I wanted to use something like:

     bysort day: egen Hour_of_Day = seq(2, 3, 4, 5, 6, 7, 1)

However, Stata tells me that this has a syntax error. Note: I start with "2" because the my first day (195) is a Tuesday. I also considered commands like carryforward or mod(x,y) or fill.

Does anyone know how I can set the sequence to fill the same for each day? How can I fix this code to achieve the desired output?


Solution

  • If you know that 195 was Tuesday then the reverse engineering is straightforward. 193 must have been Sunday and 199 Saturday.

    Let's look at a sandbox with that week, 193 to 199. Our first guess at a day of week function of our own will use the mod() function (not a command). This paper is a short riff on its applications in Stata.

    . clear
    
    . set obs 7
    number of observations (_N) was 0, now 7
    
    . gen day = 192 + _n
    
    . gen dow = mod(day, 7)
    
    . list, sep(0)
    
         +-----------+
         | day   dow |
         |-----------|
      1. | 193     4 |
      2. | 194     5 |
      3. | 195     6 |
      4. | 196     0 |
      5. | 197     1 |
      6. | 198     2 |
      7. | 199     3 |
         +-----------+
    

    Stata's convention for day of week is that 0 is Sunday and 6 is Saturday. That is just a rotation away.

    . gen DOW = mod(day + 3, 7)
    
    . list, sep(0) 
    
         +-----------------+
         | day   dow   DOW |
         |-----------------|
      1. | 193     4     0 |
      2. | 194     5     1 |
      3. | 195     6     2 |
      4. | 196     0     3 |
      5. | 197     1     4 |
      6. | 198     2     5 |
      7. | 199     3     6 |
         +-----------------+
    

    You can check with Stata's own dow() function that another way to get DOW above is

    gen StataDOW = dow(day - 2)
    

    So an indicator for weekday is (for example)

    gen weekday = !inlist(DOW, 0, 6) 
    

    or

    gen weekday = inrange(DOW, 1, 5) 
    

    or

    gen weekday = !inlist(dow, 4, 3) 
    

    using the first variable created.

    As it happens, I originally wrote egen, seq(). Your syntax is indeed not legal, as seq() is the syntax, but nothing is ever placed within the parentheses. I wouldn't use egen here, if only because the right answers are essentially impossible with multiple occurrences (as you do have) and also unlikely if you have gaps in the data. The reasoning here is, or should be, robust to repetitions and gaps.