Search code examples
stata

How to generate indicator if value of variable is observed in two different periods in Stata


I have a dataset containing various drugs and the dates they were supplied. I would like to create an indicator variable DIBP that takes a value of 1 if the same drug was supplied during both period 1 and period 2 of a given year, and zero otherwise. Period 1 is 1 April to 30 June, and period 2 is 1 October to 31 December.

I have written the following code:

. input id month day year str10 drug 

            id      month        day       year        drug
  1. 1 5  1 2003 aspirin
  2. 1 11 1 2003 aspirin
  3. 1 6  1 2004 aspirin
  4. 1 5  1 2005 aspirin
  5. 1 11 1 2005 aspirin
  6. end

. 
. gen date = mdy(month,day,year)

. format date %d

. 
. gen     period = 1 if inlist(month,4,5,6)
(2 missing values generated)

. replace period = 2 if inlist(month,10,11,12)
(2 real changes made)

. 
. label define plab 1"1 April to 30 June" 2"1 October to 31 December"

. label value period plab

. 
. * Generate indicator
. gen DIBP = 0

. label var DIBP "Drug In Both Periods"

. 
. bysort id year: replace DIBP = 1 if drug[period==1] == "aspirin" & drug[period==2] == "aspirin"
(0 real changes made)

. 
. list

     +---------------------------------------------------------------------------------+
     | id   month   day   year      drug        date                     period   DIBP |
     |---------------------------------------------------------------------------------|
  1. |  1       5     1   2003   aspirin   01may2003         1 April to 30 June      0 |
  2. |  1      11     1   2003   aspirin   01nov2003   1 October to 31 December      0 |
  3. |  1       6     1   2004   aspirin   01jun2004         1 April to 30 June      0 |
  4. |  1       5     1   2005   aspirin   01may2005         1 April to 30 June      0 |
  5. |  1      11     1   2005   aspirin   01nov2005   1 October to 31 December      0 |
     +---------------------------------------------------------------------------------+

I would expect DIBP to take a value of 1 for observations 1,2,3 and 4 (because they took aspirin during both periods for years 2003 and 2005) and a value of zero for observation 3 (because aspirin was only taken during one period in 2004), but this isn't the case. Where am I going wrong? Thank you.


Solution

  • There is a problem apparent with your use of subscripts. You seem to be assuming that a subscript can be used to select other observations, which can indeed be done individually. But what you tried is legal yet not what you want.

    The expressions used as subscripts

    period == 1 
    
    period == 2 
    

    will be evaluated as true (1) or false (0) according to the value of period in the current observation. Then either observation 0 (which is always regarded as having missing values) or observation 1 (the first in each group of observations) will be used. Otherwise put, subscripts evaluate as observation numbers, not as defining subsets of the data.

    There is a further puzzle because even for the same person and year it seems that in principle period 1 or period 2 could mean several observations. In the example given, the drug is constant any way, but what would you expect the code to do if the drug was different? The crux most evident to me is distinguishing between a flag for any prescriptions of a certain drug and all prescriptions of that drug in a period. More at this FAQ.

    Otherwise this code may help. Extension to several drugs is left as an exercise.

    clear 
    input id month day year str10 drug 
    1 5  1 2003 aspirin
    1 11 1 2003 aspirin
    1 6  1 2004 aspirin
    1 5  1 2005 aspirin
    1 11 1 2005 aspirin
    end
    
    generate date = mdy(month,day,year)
    format date %td
    
    * code needs modification if any month is 1, 2, 3, 7, 8, 9 
    
    generate period = 1 if inlist(month,4,5,6)
    replace period = 2 if inlist(month,10,11,12)
    label define plab 1"1 April to 30 June" 2"1 October to 31 December"
    label value period plab
    
    bysort id year period (date): egen all_aspirin = min(drug == "aspirin") 
    by id year period: egen any_aspirin = max(drug == "aspirin") 
    by id year : gen both_all_aspirin = period[1] == 1 & period[_N] == 2 & all_aspirin[1] & all_aspirin[_N]
    by id year : gen both_any_aspirin = period[1] == 1 & period[_N] == 2 & any_aspirin[1] & any_aspirin[_N]
    
    list id date drug *aspirin 
    
         +----------------------------------------------------------------------+
         | id        date      drug   all_as~n   any_as~n   b~ll_a~n   b~ny_a~n |
         |----------------------------------------------------------------------|
      1. |  1   01may2003   aspirin          1          1          1          1 |
      2. |  1   01nov2003   aspirin          1          1          1          1 |
      3. |  1   01jun2004   aspirin          1          1          0          0 |
      4. |  1   01may2005   aspirin          1          1          1          1 |
      5. |  1   01nov2005   aspirin          1          1          1          1 |
         +----------------------------------------------------------------------+
    

    As a style note, consider this example

     generate dummy = 0 
     replace dummy = 1 if frog == 42 
    

    Experienced Stata programmers generally just write

     generate dummy = frog == 42 
    

    See also this FAQ