Search code examples
loopsstatapanel-datadata-management

How to compare and select non-changing variables in panel data


I have unbalanced panel data and need to exclude observations (in t) for which the income changed during the year before (t-1), while keeping other observations of these people. Thus, if a change in income happens in year t, then year t should be dropped (for that person).

clear
input year id income
2003 513 1500
2003 517 1600
2003 518 1400
2004 513 1500
2004 517 1600
2004 518 1400
2005 517 1600
2005 513 1700
2005 518 1400
2006 513 1700
2006 517 1800
2006 518 1400
2007 513 1700
2007 517 1600
2007 518 1400
2008 513 1700
2008 517 1600
2008 518 1400
end

xtset id year
xtline income, overlay

To illustrate what's going on, I add a xtline plot, which follows the income per person over the years. ID=518 is the perfect non-changing case (keep all obs). ID=513 has one time jump (drop year 2005 for that person). ID=517 has something like a peak, perhaps one time measurement error (drop 2006 and 2007).

enter image description here

I think there should be some form of loop. Initialize the first value for each person (because this cannot be compared), say t0. Then compare t1-t0, drop if changed, else compare t2-t1 etc. Because data is unbalanced there might be missing year-obervations. Thanks for advice.

Update/Goal: The purpose is prepare the data for a fixed effects regression analysis. There is another variable, reported for the entire "last year". Income however is reported at interview date (point in time). I need to get close to something like "last year income" to relate it to this variable. The procedure is suggested and followed by several publications. I try to replicate and understand it.

Solution:

bysort id (year) : drop if income != income[_n-1] & _n > 1

Solution

  • bysort id (year) : gen byte flag = (income != income[_n-1]) if _n > 1
    list, sepby(id)
    

    The procedure is VERY IFFY methodologically. There is no need to prepare for the fixed effects analysis other than xtsetting the data; and there rarely is any excuse to create missing data... let alone do so to squeeze the data into the limits of what (other) researchers know about statistics and econometrics. I understand that this is a replication study, but whatever you do with your replication and wherever you present it, you need to point out that the original authors did not have much clue about regression to begin with. Don't try too hard to understand it.