Search code examples
stata

Stata: How to count the number of 'active' cases in a group when new case is opened?


I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting

invalid syntax

For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.

The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!

https://i.sstatic.net/z4iyR.jpg

EDIT

Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.


by EmpID: generate CaseNum = _n

egen group = group(EmpID)   
su group, meanonly

gen NumActiveWhenOpen = 0

  forvalues i = 1/ 'r(max)' {

    forvalues x = 1/CaseNum if group == `i'{

        count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n] 
  } 
}

Solution

  • This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.

    I'll steal part of Roberto Ferrer's sandbox.

    clear
    set more off
    input ///
    caseid str15(open close) empid 
    1 "1/1/2010" "3/1/2010" 1 
    2 "2/5/2010" "" 1 
    3 "2/15/2010" "4/7/2010" 1 
    4 "3/5/2010" "" 1 
    5 "3/15/2010" "6/15/2010" 1 
    6 "3/24/2010" "3/24/2010" 1 
    1 "1/1/2010" "3/1/2010" 2 
    2 "2/5/2010" "" 2 
    3 "2/15/2010" "4/7/2010" 2 
    4 "3/5/2010" "" 2 
    5 "3/15/2010" "6/15/2010" 2 
    end
    
    gen d1 = date(open, "MDY")
    gen d2 = date(close, "MDY")
    format %td d1 d2
    drop open close
    
    reshape long d, i(empid caseid) j(status) 
    replace status = -1 if status == 2 
    replace status = . if missing(d) 
    bysort empid (d) : gen nopen = sum(status) 
    bysort empid d : replace nopen = nopen[_N] 
    
    l
    

    The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:

    . l, sepby(empid)
    
         +---------------------------------------------+
         | empid   caseid   status           d   nopen |
         |---------------------------------------------|
      1. |     1        1        1   01jan2010       1 |
      2. |     1        2        1   05feb2010       2 |
      3. |     1        3        1   15feb2010       3 |
      4. |     1        1       -1   01mar2010       2 |
      5. |     1        4        1   05mar2010       3 |
      6. |     1        5        1   15mar2010       4 |
      7. |     1        6        1   24mar2010       4 |
      8. |     1        6       -1   24mar2010       4 |
      9. |     1        3       -1   07apr2010       3 |
     10. |     1        5       -1   15jun2010       2 |
     11. |     1        2        .           .       2 |
     12. |     1        4        .           .       2 |
         |---------------------------------------------|
     13. |     2        1        1   01jan2010       1 |
     14. |     2        2        1   05feb2010       2 |
     15. |     2        3        1   15feb2010       3 |
     16. |     2        1       -1   01mar2010       2 |
     17. |     2        4        1   05mar2010       3 |
     18. |     2        5        1   15mar2010       4 |
     19. |     2        3       -1   07apr2010       3 |
     20. |     2        5       -1   15jun2010       2 |
     21. |     2        4        .           .       2 |
     22. |     2        2        .           .       2 |
         +---------------------------------------------+
    

    The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.