Say that I have this MWE:
clear all
input str2 person enr_year enr_term
"a" 2000 1
"a" 2000 2
"a" 2000 2
"a" 2000 3
"a" 2000 3
"a" 2001 1
"a" 2001 2
"a" 2001 3
"a" 2002 2
"a" 2002 2
"a" 2003 2
"a" 2006 1
"a" 2006 2
"a" 2008 2
"b" 2000 2
"b" 2001 3
end
label define term 1 "Summer" 2 "Fall" 3 "Spring"
label values enr_term term
Some explanation is in order. This is school enrollment data. person
is an person, and everything needs to be done within person.
enr_year
is an academic year. enr_term
is an academic term. The reason that Summer and Fall come before Spring is because the year is academic year, not calendar year.
Each line in the data implicitly means that the person enrolled in the given year and term.
My task is to create two indicator variables: enr_this_spring
and enr_next_fall
. I can successfully get enr_this_spring
. I have included my code to do so in case that logic would be helpful to figure out how to get enr_next_fall
.
*These indicator variables should only be created for the observations with fall enrollment.
enr_this_spring
means that the person enrolled the following spring. Because we only make this variable for the fall semester, this will be 1 if there is a spring observation in the same year. It will be 0 otherwise, even if there is a spring observation in the next year.
enr_next_fall
will be 1 if there is a fall observation from the next year. As will be described below, the problem I am not sure how to overcome arises if there is a situation in which the student enrolls in the fall of x, not in the fall of x+1, but again the fall of x+n,
where n>1.
If there are two fall observations within the same year (multiple enrollment periods, perhaps the student was enrolled at two schools at the same time), they will both take on the same values.
Here is what I want to get:
clear all
input str2 person enr_year enr_term enr_this_spring enr_next_fall
"a" 2000 1 . . // missing because not Fall
"a" 2000 2 1 1 // 1 b/c a/2000/3; 1 b/c a/2001/2
"a" 2000 2 1 1 // same reasons as line directly above
"a" 2000 3 . . // missing because not Fall
"a" 2000 3 . . // missing because not Fall
"a" 2001 1 . . // missing because not Fall
"a" 2001 2 1 1 // 1 b/c a/2001/3; 1 b/c a/2002/2
"a" 2001 3 . . // missing because not Fall
"a" 2002 2 0 1 // 0 b/c no a/2002/3; 1 b/c a/2003/2
"a" 2002 2 0 1 // same reasons as line directly above
"a" 2003 2 0 0 // 0 b/c no a/2003/2; 0 b/c no a/2004/2
"a" 2006 1 . . // missing because not Fall
"a" 2006 2 0 0 // 0 b/c no a/2006/3; 0 b/c no a/2007/2
"a" 2008 2 0 0 // 0 b/c no a/2008/3; 0 b/c no a/2009/2
"b" 2000 2 0 0 // 0 b/c no a/2000/3; 0 b/c no a/2001/2
"b" 2001 3 . . // missing because not Fall
end
label define term 1 "Summer" 2 "Fall" 3 "Spring"
label values enr_term term
Starting with the original data, I first can successfully get enr_this_spring as follows:
*Create indicators for if the term is spring and if term is fall
gen is_spring = enr_term == 3
gen is_fall = enr_term ==2
*Get the maximum value, within person and year
bys person enr_year: egen enr_this_spring = max(is_spring)
replace enr_this_spring=. if is_fall!=1
I am not sure how to create an indicator for if ther person enrolled next fall.
Here is what I have tried, with an explanation for why it doesn't work following the code:
*Preserve the data. We are going to process it and merge back on
preserve
*We only are concerned about fall attendance for this part
keep if enr_term==2
*We only want one observation per term, as duplicates mess up the code
bys person enr_year enr_term: keep if _n==1
*Make a variable that is a constant 1
gen one = 1
*Make a variable, enr_next_fall that is 1 if the person enrolled in the fall
* in the following observation. Note that we do this within group and sort
* by enr_year
bys person (enr_year): gen enr_next_fall = one[_n+1]
* Replace missing with 0. This only affects the final observation within group
replace enr_next_fall = 0 if missing(enr_next_fall)
*Create temporary file, to be merged on
tempfile a
save `a'
restore
*Merge on the temporary file
merge m:1 person enr_year enr_term using `a'
drop is_spring is_fall one _merge
This does not get me what I want in the cases where the person did not enroll in the fall of the next year, but came back. Maybe they got sick and missed an entire school year. How should I fix this?
I think I have figured it out:
clear all
input str2 person enr_year enr_term
"a" 2000 1
"a" 2000 2
"a" 2000 2
"a" 2000 3
"a" 2000 3
"a" 2001 1
"a" 2001 2
"a" 2001 3
"a" 2002 2
"a" 2002 2
"a" 2003 2
"a" 2006 1
"a" 2006 2
"a" 2008 2
"b" 2000 2
"b" 2001 3
end
label define term 1 "Summer" 2 "Fall" 3 "Spring"
label values enr_term term
*Create indicators for if the term is spring and if term is fall
gen is_spring = enr_term == 3
gen is_fall = enr_term ==2
*Get the maximum value, within person and year
bys person enr_year: egen enr_this_spring = max(is_spring)
replace enr_this_spring=. if is_fall!=1
*Create enr_next_fall variable. Merge back on
preserve
keep if enr_term==2
bys person enr_year: keep if _n==1
bys person (enr_year): gen next = enr_year[_n+1]
replace next = next - 1
gen enr_next_fall = enr_year==next
drop next
tempfile fall
save `fall'
restore
merge m:1 person enr_year using `fall'
drop _merge
replace enr_next_fall = . if enr_term!=2