Search code examples
rdata.tablecolsubject

picking first date for each individual R


This question is related to this post give each id the same column value R butthe solution there is not working for me.

I have a datatable where index is the first date that an individual got a drug C10.*? in the period between 2010-04-01 and 2010-09-30:

names   drugs      dates      index
1:  mary C10AA07 2009-10-01         NA
2:  mary C09AA03 2010-06-01         NA
3:  mary C10AA07 2010-07-01 2010-07-01
4:  mary A02BC01 2010-07-01         NA
5:  mary C10AA07 2010-07-24 2010-07-01
6:   tom C10AA05 2009-12-01         NA
7:   tom C10AA05 2010-04-06 2010-04-06
8:   tom C07AB03 2010-05-12         NA
9:   tom C10AA05 2010-08-01 2010-04-06

I am trying to give mary her index date for all rows in the column 'index'. and similarly for tom. so that the ouput is like this:

 names   drugs      dates      index
1:  mary C10AA07 2009-10-01 2010-07-01
2:  mary C09AA03 2010-06-01 2010-07-01
3:  mary C10AA07 2010-07-01 2010-07-01
4:  mary A02BC01 2010-07-01 2010-07-01
5:  mary C10AA07 2010-07-24 2010-07-01
6:   tom C10AA05 2009-12-01 2010-04-06
7:   tom C10AA05 2010-04-06 2010-04-06
8:   tom C07AB03 2010-05-12 2010-04-06
9:   tom C10AA05 2010-08-01 2010-04-06

this is exactly what the problem is in the link posted above. These are the lines of code I have tried, but each code gives me pack just all NA values for index or else doesnt change my dt2

Attempt 1:

dt2[, index := index[grepl('^C10.*?', as.character(dt2$drugs))& dt2$dates>="2010-04-01" & dt2$dates<"2010-10-01"][1], by = names]     
dt2

attempt 2:

dt2[, index := index[grepl('^C10.*?', as.character(dt2$drugs))[1], by = names])
dt2

I can't understand whats happening and why the code won't work. If anyone can shed any light on this that would be great. Thank you.

attempt 3:

dt2[, index := index[drugs == 'C10AA05' & drugs=='C10AA07'][1], by = names]
dt2

Solution

  • By what you are describing, if your index is not yet filled, use this:

    dt2[, index := min(dates[grepl("^C10", drugs)], na.rm=TRUE), by=names]
    

    If your index already has the correct value, and you are simply trying to fill the NA's, use the following instead, as it will be faster

    > dt2[, index := index[!is.na(index)][[1]], by=names]
    > dt2
       names   drugs      dates      index
    1:  mary C10AA07 2009-10-01 2010-07-01
    2:  mary C09AA03 2010-06-01 2010-07-01
    3:  mary C10AA07 2010-07-01 2010-07-01
    4:  mary A02BC01 2010-07-01 2010-07-01
    5:  mary C10AA07 2010-07-24 2010-07-01
    6:   tom C10AA05 2009-12-01 2010-04-06
    7:   tom C10AA05 2010-04-06 2010-04-06
    8:   tom C07AB03 2010-05-12 2010-04-06
    9:   tom C10AA05 2010-08-01 2010-04-06
    > 
    

    If you are going to be doing this often, I would recommend setting key to drugs or even creating a new column with the drugid. Note that you can use the key in .SD, so the following would work for you:

    dt2[, drugid := substr(drugs, 1, 3)]
    setkey(dt2, drugid)
    
    ## HAVE A LOOK AT THE OUTPUT
    dt2[, .SD[.("C10"), min(dates)]]
    dt2[, .SD[.("C10"), min(dates)], by=names]
    dt2[, .SD[.("C10"), min(dates)]$V1, by=names]
    dt2[, index := .SD[.("C10"), min(dates)]$V1, by=names]