Search code examples
statapanel-data

Keeping only the cases with non-zero observations in a fixed effects regression


I'm currently working with SOEP panel data (years 2002 through 2010) and I've run into a bit of trouble. I'm trying to run a fixed effects regression with child allowance as the dependent variable and person year as the independent variable. I would only like to include individuals who have given their child an allowance for at least one year in the sample. I assume that individuals who did not give their child an allowance at any point during this time span are automatically dropped since fixed effects regression measures within variation and doesn't include individuals with no variation in a given variable (think cohort). That would be my question: Does xtreg, dvar ivar1 ivar2 ivarx, fe vce(cluster id) automatically drop individuals with no variation in a given variable, or do I have to manually drop them from the regression? If so, how would I drop those individuals?

Edit:

I would only like to drop individuals who do not provide their children any allowance during the observation period, while including those who do at least once. I ended up using the following code:

*Generate variable for cases with at least one non-zero value over the observation period 
bysort id (year): egen sumofallowance= sum(childallowance) 

*Run a fixed effects regression using this variable as a condition.
xtreg childallowance idyear if sumofallowance>0, fe vce(cluster id)

I was unsure whether Stata would automatically do this, and it turns out Stata does not as I have about 9,000 fewer observations.


Solution

  • If you know your sample that you want to run your regression on it is always a good idea to restrict your regression to that sample explicitly. Especially if you do not have very precise understanding on how the command you are using deals with no variation, missing values etc. If you want your work to be reproducible then this is a good practice to use even if you understand how the command works as it makes your code more readable for other people who might not understand this command as well.

    I would suggest that you create a variable called something like sample that you set to 1 for all observations that you want to include in the regression and then restrict xtreg to that sample. In your case it would be something like this:

    *Create a variable that is 1 for the intended sample
    gen sample = (dvar > 0 & !missing(dvar)) // Sets sample = 1 if non-zero and non-missing, otherwise set to 0
    
    *Run regression restricted to sample
    xtreg dvar ivar1 ivar2 ivarx if sample == 1, fe vce(cluster id)
    

    You might have to edit the line that generate sample depending on the format of the data in dvar.

    Finally, xtreg could drop observations for other reasons, so make sure that the N in the regressions match the number of 1 you get in tab sample, m and make sure that this also match your expectation of the sample you have in mind.