I need to transform a data frame containing population information for each sampling date into a data frame with individual information to run a survival analysis. My data look like this:
Place=c(rep("Europe",6))
Age=c(rep("Newborn",3),rep("Young",3))
Date_sample=as.Date(c('2014-03-18','2014-10-01','2015-01-15','2014-06-16','2014-12-21','2015-01-15'))
Number_indiv_status1=c(0,2,1,0,2,2)
Number_indiv_status2=c(10,8,7,7,5,3)
df<-data.table(Place,Age,Date_sample,Number_indiv_status1,Number_indiv_status2)
> df
Place Age Date_sample Number_indiv_status1 Number_indiv_status2
1: Europe Newborn 2014-03-18 0 10
2: Europe Newborn 2014-10-01 2 8
3: Europe Newborn 2015-01-15 1 7
4: Europe Young 2014-06-16 0 7
5: Europe Young 2014-12-21 2 5
6: Europe Young 2015-01-15 2 3
And I need to obtain this:
> new_df
Place Age Date_sample Number_indiv_status1 Number_indiv_status2 Status date_event
1: Europe Newborn 2014-10-01 2 8 1 2014-05-30
2: Europe Newborn 2014-10-01 2 8 1 2014-08-15
3: Europe Newborn 2015-01-15 1 7 1 2014-12-17
4: Europe Newborn 2015-01-15 1 7 2 2015-01-15
5: Europe Newborn 2015-01-15 1 7 2 2015-01-15
6: Europe Newborn 2015-01-15 1 7 2 2015-01-15
7: Europe Newborn 2015-01-15 1 7 2 2015-01-15
8: Europe Newborn 2015-01-15 1 7 2 2015-01-15
9: Europe Newborn 2015-01-15 1 7 2 2015-01-15
10: Europe Newborn 2015-01-15 1 7 2 2015-01-15
11: Europe Young 2014-12-21 2 5 1 2014-09-01
12: Europe Young 2014-12-21 2 5 1 2014-09-21
13: Europe Young 2015-01-15 2 3 1 2014-12-29
14: Europe Young 2015-01-15 2 3 1 2015-01-02
15: Europe Young 2015-01-15 2 3 2 2015-01-15
16: Europe Young 2015-01-15 2 3 2 2015-01-15
17: Europe Young 2015-01-15 2 3 2 2015-01-15
I wrote the following code, that does not work:
tot_lines <- df %>% group_by(Age) %>% slice(1) %>% ungroup() %>% summarise(tot_lines=sum(Number_indiv_status2))
new_df <- data.frame(matrix(NA, nrow = tot_lines[[1]], ncol = 7))
colnames(new_df)=c(colnames(df),"Status","date_event")
k=0
for (i in 1:nrow(df)) {
if(df[i,"Number_indiv_status1"]>0){
for (j in 1:df[[i,"Number_indiv_status1"]]){
new_df[k+j,c(1:5)]=df[i,c(1:5)]
new_df[k+j,6]=1
new_df[k+j,7]=sample(seq.POSIXt(as.POSIXct(df[[i-1,3]]), as.POSIXct(df[[i,3]]),by="day"), size = 1) #random date between df[i,3] and df[i+1,3]
k=sum(complete.cases(new_df))
}
} else {
}
if(i==sum(df$Age=="Newborn")) {
for (l in 1:df[i,"Number_indiv_status2"]) {
new_df[k+l,c(1:5)]=df[l,c(1:5)]
new_df[k+l,6]=2
new_df[k+l,7]=df[i,3]
} else {
}
}
k=sum(complete.cases(new_df))
}
I have id several errors/tasks in the loop that I need to solve but cannot figure out:
there is a Date
isssue here : new_df[2,c(1:5)]=df[2,c(1:5)]
that I don't understand as class(df$Date_sample)
returns "Date" cf this post. I have tried to use new_df[1,3]=ymd(df[[2,3]])
or new_df[1,3]=as_date(df[[2,3]])
as mentioned here, without success. I still get "16344" instead of ""2014-10-01" (which is the matching integer but not the date format). Why and how can I solve this?
I tried assigning a random date in the time interval following this, which does not work here:
new_df[1,7]=sample(seq.POSIXt(as.POSIXct(df[[1,3]]), as.POSIXct(df[[2,3]]),by="day"), size = 1)
I believe it is a matter of format, because it returns "1409443200" and as_date(1409443200) is not relevant ("3860894-05-31"). I also read this and this but I would like to avoid creating a function in or before the loop. I also checked the lubridate
package to find an elegant option, but could not figure it out. If anyone has an idea about that option, it would be great.
As my loop does not work, I am not sure my indexes (i, j k and l) are well coded, and placed in the right place.
once the loop works : is there a way to insert that in a pipe %>%
for example? I have actually more than one Place, and more than 2 Age classes, so I would need to group_by to operation by Place and Age, but append a single new data frame new_df.
Would there be a non-loop option to do the same, with the tidyverse
for example? I try to avoid loops, but here I don't see how to manage it.
Last but not least: still new on the site, should I have asked my questions in separate posts?
Edit
I found a solution for point 1: setting
new_df$Date_sample <- as.Date(new_df$Date_sample)
before k=0
and entering the loop solves the format issue for new_df. I still don't know why using ymd()
or as_date
in the loop does not work though.
I found a way to assign a random date in the interval between two sampling times. I based my code on the python suggestion here (first answer) to get to this:
sample(unclass(as.Date(df[[i,3]]))-unclass(as.Date(df[[i-1,3]])),1)+df[[i-1,3]]
It also requires setting new_df$date_event <- as.Date(new_df$date_event)
before k=0 and the loop, otherwise as before the result is right but not in the date format.
I keep working on the other errors, they are still unsolved.
I could get the loop to work, which solves the points 1-3.
In the data frame, I needed to encode Age as factor:
Age=as_factor(c(rep("Newborn",3),rep("Young",3)))
Then, this does the job:
k=0
Age_fact=1
for (i in 1:nrow(df)) {
if(df[i,"Number_indiv_status1"]>0){
for (j in 1:df[[i,"Number_indiv_status1"]]){
new_df[k+j,c(1:5)]=df[i,c(1:5)]
new_df[k+j,6]=1
new_df[k+j,7]=sample(unclass(as.Date(df[[i,3]]))-unclass(as.Date(df[[i-1,3]])),1)+df[[i-1,3]]
}
k=sum(complete.cases(new_df))
}
if(i==tail(which(df$Age == levels(df$Age)[Age_fact]),1)) {
for (l in 1:df[[i,"Number_indiv_status2"]]) {
new_df[k+l,c(1:5)]=df[i,c(1:5)]
new_df[k+l,6]=2
new_df[k+l,7]=df[i,3]
}
k=sum(complete.cases(new_df))
}
if (i==tail(which(df$Age == levels(df$Age)[Age_fact]),1)) {
Age_fact=Age_fact+1
}
k=sum(complete.cases(new_df))
}
One limit though: Age now appears by factor index (1 or 2) in new_df, instead of the name of the level. And setting
new_df$Age <- as.factor(new_df$Age)
before the loop does not solve it. I can still change it later, but as my data set is much larger than this, it would be great to get the copy to work as factor.
I still have this question: is there a way to do this without a loop, with the tidyverse
?