Search code examples
rdataframedatecalculated-columnstibble

How to make new column copy values from another column AND row, based on one column in R?


Here is some example data:

data <- tibble(line_number = seq(1:5),
               test = c("testA", "testB", "testC", "testD", "testE"),
               start_date = as_date(c("2021-01-01", "2021-02-01", "2021-02-15", "2021-03-20", "2021-04-12")),
               finish_date = as_date(c("2021-01-01", "2021-03-01", "2021-02-18", "2021-05-20", "2021-04-12")),
               coded_date = c(NA, "1S", "2F", "2S", "4F"))

#       line_number test  start_date finish_date coded_date
#             <int> <chr> <date>     <date>      <chr>     
# 1               1 testA 2021-01-01 2021-01-01  NA        
# 2               2 testB 2021-02-01 2021-03-01  1S        
# 3               3 testC 2021-02-15 2021-02-18  2F        
# 4               4 testD 2021-03-20 2021-05-20  2S        
# 5               5 testE 2021-04-12 2021-04-12  4F        

I would like to create two new columns called "new_start_date" and "new_finish_date" where its contents are determined by the "coded_date" column.

For the coded_date column: the number corresponds to the line number, "S" means start date and "F" means finish date.

Taking row 2 with 1S as an example, I would like the "new_start_date" column to take the value of the start date in row 1 and copy it, leaving the "new_finish_date" with NA.

Taking row 3 with 2F as an example, I would like the "new_finish_date" column to take the value of the finish date in row 2 and copy it, leaving the "new_start_date" with NA.

Here is my desired output:

#       line_number test  start_date finish_date coded_date new_start_date new_finish_date
#             <int> <chr> <date>     <date>      <chr>      <date>         <date>
# 1               1 testA 2021-01-01 2021-01-01  NA         NA             NA
# 2               2 testB 2021-02-01 2021-03-01  1S         2021-01-01     NA
# 3               3 testC 2021-02-15 2021-02-18  2F         NA             2021-03-01
# 4               4 testD 2021-03-20 2021-05-20  2S         2021-02-01     NA
# 5               5 testE 2021-04-12 2021-04-12  4F         NA             2021-05-20

I'm still a beginner using R so any help or input would be much appreciated :)


Solution

  • My first instinct was to use two for-loops

    # deal with start date first
    # get all values in 'coded_date' that contain an 'S'
    svals <- grep(pattern = "S", x = data$coded_date, value = TRUE)
    # we'll go row by row
    # for each case (row) in that contains an 'S'
    for(sval in svals){
      # get the rowid from the value of 'coded_date'
      # this is the row where we'll get the new date
      rowid <- substring(text = sval, first = 1, last = 1)
      # assign a 'new_start_date' to the row where we found sval
      # the row containing this new value is defined by rowid
      # use '%in% rather than '==' on left side because NAs are present
      data[data$coded_date %in% sval, "new_start_date"] <- data[rowid,"start_date"]
    }
    
    ## repeat for finish date
    # S and F loops could be nested together!
    fvals <- grep(pattern = "F", x = data$coded_date, value = TRUE)
    for(fval in fvals){
      rowid <- substring(text = fval, first = 1, last = 1)
      data[data$coded_date %in% fval, "new_finish_date"] <- data[rowid,"finish_date"]
    

    e: Here's a vectorized version that appears to work. If someone is keen, I'd bet there's room for improvement here, and I'd appreciate any feedback!

    data$new_start_date2 <- ifelse(
      test = grepl(pattern = "S", x = data$coded_date),
      yes = data[sub(pattern = "S", replacement = "", data$coded_date),"start_date"],
      no = NA)
    data$new_finish_date2 <- ifelse(
      test = grepl(pattern = "F", x = data$coded_date),
      yes = data[sub(pattern = "F", replacement = "", data$coded_date),"finish_date"],
      no = NA)