I have a longitudinal dataset that records person's employment status monthly for 45 months. I would like to be able to create two variables to add to this dataset: 1) Overall duration each person spent "Unemployed" 2) Number of unemployment spells
Ideally it would also skip over NAs without interrupting the spell
I've created an example dataset to make things simple:
ID <- c(1:10, 1:10, 1:10)
date <- c("2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01",
"2006-09-01", "2006-09-01", "2006-09-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01",
"2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-11-01",
"2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01",
"2006-11-01", "2006-11-01")
act <- c("Unemployed", "Employment", "Education", "Education", "Education", "Education", "Education",
"Education", "Education", "Unemployed", "Education", "Unemployed", "Unemployed", "Unemployed",
"Education", "Education", "Employment", "Education", "Education", "NA", "Unemployed",
"Unemployed", "NA", "Unemployed", "Education", "Employment", "Employment", "NA", "Education",
"Unemployed")
df <- data.frame(ID, date, act)
df[order(ID),]
ID date act
1 1 2006-09-01 Unemployed
11 1 2006-10-01 Education
21 1 2006-11-01 Unemployed
2 2 2006-09-01 Employment
12 2 2006-10-01 Unemployed
22 2 2006-11-01 Unemployed
3 3 2006-09-01 Education
13 3 2006-10-01 Unemployed
23 3 2006-11-01 NA
4 4 2006-09-01 Education
14 4 2006-10-01 Unemployed
24 4 2006-11-01 Unemployed
5 5 2006-09-01 Education
15 5 2006-10-01 Education
25 5 2006-11-01 Education
6 6 2006-09-01 Education
16 6 2006-10-01 Education
26 6 2006-11-01 Employment
7 7 2006-09-01 Education
17 7 2006-10-01 Employment
27 7 2006-11-01 Employment
8 8 2006-09-01 Education
18 8 2006-10-01 Education
28 8 2006-11-01 NA
9 9 2006-09-01 Education
19 9 2006-10-01 Education
29 9 2006-11-01 Education
10 10 2006-09-01 Unemployed
20 10 2006-10-01 NA
30 10 2006-11-01 Unemployed
I tried a solution proposed by Roland at Calculate duration in R but I am not sure how to adapt it to give me results by ID and deal with NAs.
library(data.table)
setDT(df)
df[, date := as.POSIXct(date, format = "%Y-%m-%d", tz = "GMT")]
glimpse(df)
df$act <- ifelse(df$act == "Unemployed",1,-1)
df[, run := cumsum(c(1, diff(act) != 0))]
df1 <- df[, list(act = unique(act),
duration = difftime(max(date), min(date), unit = "weeks")),
by = run]
df1
run act duration
1: 1 1 0 weeks
2: 2 -1 0 weeks
3: 3 1 0 weeks
4: 4 -1 0 weeks
5: 5 1 0 weeks
6: 6 -1 0 weeks
7: 7 1 0 weeks
8: 8 -1 0 weeks
9: 9 1 0 weeks
10: 10 -1 0 weeks
11: 11 1 0 weeks
What I am after is to achieve this (duration here is in months but can be weeks or days):
ID spell_count duration
1 1 2 2
2 2 1 2
3 3 1 1
...
10 10 1 2
Any help with this would be greatly appreciated, any links/literature/examples.
Thank you.
I am using only your first code block, then for Overall duration, I do:
library(data.table)
setDT(df)
df_duration = df[act=="Unemployed",.(duration = .N),by = ID]
the number of unemployment spells is a little bit trickier:
df_spell_count = df[order(ID,date)]
df_spell_count <- df_spell_count[!(is.na(act)|act=="NA")]
df_spell_count[,previous_act := shift(act,1),by = ID]
df_spell_count<-df_spell_count[act =="Unemployed" & (previous_act!="Unemployed" | is.na(previous_act))]
df_spell_count<-df_spell_count[,.(spell_count =.N),by = ID]
If you want to merge both things, just:
df_stats <- merge(df_duration,df_spell_count, by = "ID", all.x = TRUE,all.y = TRUE)
Observe that this df does not contain rows for those users without unemployment periods.