I'm using a longitudinal survey in long format, and I'm trying to create a dummy variable for if an individual has NOT got a college degree by the age of 25. My data looks something like this:
ID CYRB VAR VALUE
1 1983 DEG98 1
1 1983 DEG00 1
1 1983 DEG02 1
1 1983 DEG04 0
2 1979 DEG08 0
2 1979 DEG00 0
2 1979 DEG02 1
2 1979 DEG04 1
3 1978 DEG98 NA
3 1978 DEG00 NA
3 1978 DEG02 NA
3 1978 DEG04 0
As I've tried to illustrate, there are quite a few missing data points for survey responses in the relevant years. But clearly if the respondent responds no in later years it can be inferred that they didn't have a degree when they were <25 either.
Trying to be as general as possible, how can I create a new variable that depends on all the variable values of just one individual, i.e. for ID = 1, 2, 3 etc.?
Sorry if I'm not clear!
Edit:
Sorry my fault, the data used to be in wide format and the variables denote whether the respondent has a college degree in 1998, 2000, 2002 etc. (with value denoting the response 1 == TRUE, 0 == FALSE), CYRB is indeed year of birth, the table edited for the expected output of my desired dummy variable would be:
ID CYRB VAR VALUE DUMMY
1 1983 DEG98 0 0
1 1983 DEG00 0 0
1 1983 DEG02 0 0
1 1983 DEG04 1 0
2 1979 DEG08 0 0
2 1979 DEG00 0 0
2 1979 DEG02 1 0
2 1979 DEG04 1 0
3 1978 DEG98 NA 1
3 1978 DEG00 NA 1
3 1978 DEG02 NA 1
3 1978 DEG04 0 1
i.e. if the respondent replies in any survey from the age of 25 onwards that he/she does not have a college degree the dummy takes the value of 1.
Hope this is a bit clearer.
Assuming you meant "DEG98" in the first row for ID 2:
First, recover the respondent's age:
d$survey_year <- as.numeric(sapply(d$VAR, substring, 4, 5))
d$survey_year <- ifelse(d$survey_year<20, 2000+d$survey_year, 1900+d$survey_year)
d$age <- d$survey_year - d$CYRB
Use the any()
function to test your criteria:
degree <- data.frame(DUMMY=c(
by(d, d$ID, function(x) any(x$VALUE==0 & x$age>25))))
degree$ID <- rownames(degree)
Combine the dummy values with the original dataframe:
out <- merge(d[,c("ID", "CYRB", "VAR", "VALUE")], degree, all.x=TRUE)
Output:
> out
ID CYRB VAR VALUE DUMMY
1 1 1983 DEG98 0 FALSE
2 1 1983 DEG00 0 FALSE
3 1 1983 DEG02 0 FALSE
4 1 1983 DEG04 1 FALSE
5 2 1979 DEG98 0 FALSE
6 2 1979 DEG00 0 FALSE
7 2 1979 DEG02 1 FALSE
8 2 1979 DEG04 1 FALSE
9 3 1978 DEG98 NA TRUE
10 3 1978 DEG00 NA TRUE
11 3 1978 DEG02 NA TRUE
12 3 1978 DEG04 0 TRUE
EDIT: A more parsimonious solution using the dplyr
package. First, write a getYear()
function to convert DEGxx
to the actual year:
getYear <- function(x) {
x <- as.numeric(substring(x, 4, 5))
ifelse(x<16, 2000+x, 1900+x)
}
Then transform the dataset:
library(dplyr)
d %>% group_by(ID) %>%
mutate(survey_year=getYear(VAR),
age=survey_year - CYRB,
DUMMY=any(VALUE==0 & age>25))
Output:
Source: local data frame [12 x 7]
Groups: ID [3]
ID CYRB VAR VALUE DUMMY survey_year age
(int) (int) (fctr) (int) (lgl) (dbl) (dbl)
1 1 1983 DEG98 0 FALSE 1998 15
2 1 1983 DEG00 0 FALSE 2000 17
3 1 1983 DEG02 0 FALSE 2002 19
4 1 1983 DEG04 1 FALSE 2004 21
5 2 1979 DEG98 0 FALSE 1998 19
6 2 1979 DEG00 0 FALSE 2000 21
7 2 1979 DEG02 1 FALSE 2002 23
8 2 1979 DEG04 1 FALSE 2004 25
9 3 1978 DEG98 NA TRUE 1998 20
10 3 1978 DEG00 NA TRUE 2000 22
11 3 1978 DEG02 NA TRUE 2002 24
12 3 1978 DEG04 0 TRUE 2004 26