I'm working with a panel data set with the following variables. Here's a snippet of my data:
i region urban year
8431 3 1 1979
8431 3 1 1980
8431 3 1 1981
8431 3 1 1982
8431 3 0 1983
8431 3 0 1984
8431 3 0 1985
8431 3 0 1986
8431 3 0 1987
8431 3 0 1988
8431 3 0 1989
8431 3 1 1990
8431 3 1 1991
8431 3 1 1992
8431 3 1 1993
8431 3 1 1994
8431 3 1 1996
8431 3 0 1998
8431 3 1 2000
8431 3 1 2002
8431 3 1 2004
8431 3 1 2006
8431 2 0 2008
8431 3 1 2010
8431 3 1 2012
The panel is strongly balanced, but there are some missing observations in the data that I deal with through interpolation.
I want to calculate the total number of times region
and urban
have changed during the observation period for each individual.
First thing I tried was lagged values:
bysort i: gen urban_lag = l1.urban
bysort i (year): gen urbanchange = 0 if urban==urban_lag & !missing(urban)
replace urbanchange = 1 if urban!=urban_lag & !missing(urban)
For some reason, this generates a lot of missing values, even for individuals where there are not missing values for region
. Does anyone know why this is the case?
I then tried this:
sort i year
by i: gen byte urbanchange = urban != urban[_n-1]
This mostly works and gets me slightly closer to where I want, except the first observation of each individual is equal to 1, because change[0]
is before the start of the data and so Stata returns it as missing, so I get urbanchange[1] = 1
for every individual.
This is the output I want:
i region urban year regionchange urbanchange
8431 3 1 1979 0 0
8431 3 1 1980 0 0
8431 3 1 1981 0 0
8431 3 1 1982 0 0
8431 3 0 1983 0 1
8431 3 0 1984 0 0
8431 3 0 1985 0 0
8431 3 0 1986 0 0
8431 3 0 1987 0 0
8431 3 0 1988 0 0
8431 3 0 1989 0 0
8431 3 1 1990 0 1
8431 3 1 1991 0 0
8431 3 1 1992 0 0
8431 3 1 1993 0 0
8431 3 1 1994 0 0
8431 3 1 1996 0 0
8431 3 0 1998 0 1
8431 3 1 2000 0 1
8431 3 1 2002 0 0
8431 3 1 2004 0 0
8431 3 1 2006 0 0
8431 2 0 2008 1 1
8431 3 1 2010 1 1
8431 3 1 2012 0 0
This question is basically the Stata version of this question I asked a year ago Counting the number of changes of a categorical variable during repeated measurements within a category
Edit regarding interpolation: since there are a lot of gaps in my data, for observations with missing values, I make the assumption that, for example, the individual's current urban value is the same as the last observed (non-missing) urban value.
For example:
region urban year
1 1 1979
1 1 1980
1 1 1981
3 1 1982
3 1 1983
3 1 1984
. . 1985
1 1 1986
1 1 1987
1 1 1988
1 1 1989
. . 1990
. . 1991
. . 1992
. . 1993
. . 1994
I fill in the missing values like such:
gen urban1 = urban
bysort i (year): replace urban1 = urban1[_n-1] if missing(urban1)
gsort i -year
replace urban1 = urban1[_n-1] if urban1 >=.
and do the same for region.
Output:
region urban year region1 urban1
1 1 1979 1 1
1 1 1980 1 1
1 1 1981 1 1
3 1 1982 3 1
3 1 1983 3 1
3 1 1984 3 1
. . 1985 3 1
1 1 1986 1 1
1 1 1987 1 1
1 1 1988 1 1
1 1 1989 1 1
. . 1990 1 1
. . 1991 1 1
. . 1992 1 1
. . 1993 1 1
. . 1994 1 1
I'm making some strong assumptions here and some may disagree with what I'm doing here, but for the purpose of counting the number of changes, it doesn't affect the results I want and the missing values don't mess with my output.
This answer focuses on the question, which is to count the number of changes in a panel. Your desired output is only part-way there as it consists of indicators with 1 for change and 0 for same.
Consider this code with your data example:
clear
input i region urban year
8431 3 1 1979
8431 3 1 1980
8431 3 1 1981
8431 3 1 1982
8431 3 0 1983
8431 3 0 1984
8431 3 0 1985
8431 3 0 1986
8431 3 0 1987
8431 3 0 1988
8431 3 0 1989
8431 3 1 1990
8431 3 1 1991
8431 3 1 1992
8431 3 1 1993
8431 3 1 1994
8431 3 1 1996
8431 3 0 1998
8431 3 1 2000
8431 3 1 2002
8431 3 1 2004
8431 3 1 2006
8431 2 0 2008
8431 3 1 2010
8431 3 1 2012
end
bysort i (year) : gen change1 = sum(region != region[_n-1])
by i : replace change1 = change1[_N]
by i : gen change2 = sum(urban != urban[_n-1])
by i : replace change2 = change2[_N]
tabdisp i, c(change?)
----------------------------------
i | change1 change2
----------+-----------------------
8431 | 3 7
----------------------------------
sum()
calculates the running or cumulative sum, and looking at the last value in each panel gives you the number of spells or runs on a variable, each spell or run being defined by the variable in question being constant.
It's important to think about what happens in the first observation of each panel. That is observation 1. As you say, Stata is happy with a reference, implicit or explicit, to values in the previous observation 0 but always returns a missing value. So in the data example region[1]
is 1, region[0]
is returned as missing, and region[1]
is therefore not equal to region[0]
. Hence even if a variable were entirely constant in a panel each count variable produced by this code would be 1. If you want only changes within a panel, subtract 1 in the last statement.
For more on the principles of identifying spells, see this column.
I used subscripting rather than time series operators because of gaps in your panel. I can't follow your mention of interpolation sufficiently to work out what the data would look like after interpolation and in any case the data example contains gaps and that is what people answering can see.
So, why did you get missing values? First, your code would produce missing values in the first observation of each panel unless the first value was also missing. Second, gaps would cause missing values with time series operators.
Your first block of code
bysort i: gen urban_lag=l1.urban
bysort i (year): gen urbanchange = 0 if urban==urban_lag & !missing(urban)
replace urbanchange = 1 if urban!=urban_lag & !missing(urban)
depends on a previous
tsset i year
but given that you did that earlier, the block almost reduces to
gen urbanchange = urban != l1.urban & !missing(urban)
It's perfectly legal to use a by:
prefix with the panel identifier and also to use time series operators, but not necessary. Part of the white magic of time series operators is that where separate panels exist, separate calculations are guaranteed.