Search code examples
variablesstataindicator

How to find if observation within group has certain values


Say that I have this MWE:

clear all
input str2 person enr_year enr_term
"a" 2000 1     
"a" 2000 2   
"a" 2000 2 
"a" 2000 3   
"a" 2000 3 
"a" 2001 1   
"a" 2001 2   
"a" 2001 3   
"a" 2002 2
"a" 2002 2    
"a" 2003 2      
"a" 2006 1
"a" 2006 2
"a" 2008 2  
"b" 2000 2
"b" 2001 3
end

label define term 1 "Summer" 2 "Fall" 3 "Spring"
label values enr_term term

Some explanation is in order. This is school enrollment data. person is an person, and everything needs to be done within person.

enr_year is an academic year. enr_term is an academic term. The reason that Summer and Fall come before Spring is because the year is academic year, not calendar year.

Each line in the data implicitly means that the person enrolled in the given year and term.

My task is to create two indicator variables: enr_this_spring and enr_next_fall. I can successfully get enr_this_spring. I have included my code to do so in case that logic would be helpful to figure out how to get enr_next_fall.

*These indicator variables should only be created for the observations with fall enrollment.

enr_this_spring means that the person enrolled the following spring. Because we only make this variable for the fall semester, this will be 1 if there is a spring observation in the same year. It will be 0 otherwise, even if there is a spring observation in the next year.

enr_next_fall will be 1 if there is a fall observation from the next year. As will be described below, the problem I am not sure how to overcome arises if there is a situation in which the student enrolls in the fall of x, not in the fall of x+1, but again the fall of x+n, where n>1.

If there are two fall observations within the same year (multiple enrollment periods, perhaps the student was enrolled at two schools at the same time), they will both take on the same values.

Here is what I want to get:

clear all
input str2 person enr_year enr_term enr_this_spring enr_next_fall
"a" 2000 1  .   . // missing because not Fall    
"a" 2000 2  1   1 // 1 b/c a/2000/3; 1 b/c a/2001/2   
"a" 2000 2  1   1 // same reasons as line directly above 
"a" 2000 3  .   . // missing because not Fall  
"a" 2000 3  .   . // missing because not Fall 
"a" 2001 1  .   . // missing because not Fall   
"a" 2001 2  1   1 // 1 b/c a/2001/3; 1 b/c a/2002/2   
"a" 2001 3  .   . // missing because not Fall   
"a" 2002 2  0   1 // 0 b/c no a/2002/3; 1 b/c a/2003/2  
"a" 2002 2  0   1 // same reasons as line directly above    
"a" 2003 2  0   0 // 0 b/c no a/2003/2; 0 b/c no a/2004/2       
"a" 2006 1  .   . // missing because not Fall  
"a" 2006 2  0   0 // 0 b/c no a/2006/3; 0 b/c no a/2007/2  
"a" 2008 2  0   0 // 0 b/c no a/2008/3; 0 b/c no a/2009/2  
"b" 2000 2  0   0 // 0 b/c no a/2000/3; 0 b/c no a/2001/2
"b" 2001 3  .   . // missing because not Fall
end
label define term 1 "Summer" 2 "Fall" 3 "Spring"
label values enr_term term

Starting with the original data, I first can successfully get enr_this_spring as follows:

*Create indicators for if the term is spring and if term is fall
gen is_spring = enr_term == 3
gen is_fall = enr_term ==2
*Get the maximum value, within person and year
bys person enr_year: egen enr_this_spring = max(is_spring)
replace enr_this_spring=. if is_fall!=1

I am not sure how to create an indicator for if ther person enrolled next fall.

Here is what I have tried, with an explanation for why it doesn't work following the code:

*Preserve the data. We are going to process it and merge back on
preserve
*We only are concerned about fall attendance for this part
keep if enr_term==2
*We only want one observation per term, as duplicates mess up the code
bys person enr_year enr_term: keep if _n==1 
*Make a variable that is a constant 1
gen one = 1
*Make a variable, enr_next_fall that is 1 if the person enrolled in the fall
* in the following observation. Note that we do this within group and sort
* by enr_year
bys person (enr_year): gen enr_next_fall = one[_n+1]
* Replace missing with 0. This only affects the final observation within group
replace enr_next_fall = 0 if missing(enr_next_fall)
*Create temporary file, to be merged on
tempfile a
save `a'
restore
*Merge on the temporary file
merge m:1 person enr_year enr_term using `a'
drop is_spring is_fall one _merge

This does not get me what I want in the cases where the person did not enroll in the fall of the next year, but came back. Maybe they got sick and missed an entire school year. How should I fix this?


Solution

  • I think I have figured it out:

    clear all
    input str2 person enr_year enr_term
    "a" 2000 1     
    "a" 2000 2   
    "a" 2000 2 
    "a" 2000 3   
    "a" 2000 3 
    "a" 2001 1   
    "a" 2001 2   
    "a" 2001 3   
    "a" 2002 2
    "a" 2002 2    
    "a" 2003 2      
    "a" 2006 1
    "a" 2006 2
    "a" 2008 2  
    "b" 2000 2
    "b" 2001 3
    end
    
    label define term 1 "Summer" 2 "Fall" 3 "Spring"
    label values enr_term term
    
    *Create indicators for if the term is spring and if term is fall
    gen is_spring = enr_term == 3
    gen is_fall = enr_term ==2
    *Get the maximum value, within person and year
    bys person enr_year: egen enr_this_spring = max(is_spring)
    replace enr_this_spring=. if is_fall!=1
    
    *Create enr_next_fall variable. Merge back on
    preserve
    keep if enr_term==2
    bys person enr_year: keep if _n==1
    bys person (enr_year): gen next = enr_year[_n+1]
    replace next = next - 1
    gen enr_next_fall = enr_year==next
    drop next
    tempfile fall
    save `fall'
    restore
    merge m:1 person enr_year using `fall'
    drop _merge
    replace enr_next_fall = . if enr_term!=2