Search code examples
rdataframerep

R : Repeat rows in a data frame with new set of values for certain columns


employee <- c("John", "Adi", "Sam")
salary <- c(21000, 22000, 23000)
startdate <- as.Date(c("2014-11-01","2014-01-01","2014-10-01"))
enddate <- as.Date(c("2015-10-31","2014-12-31","2015-10-31"))
N<- c(2,1,2)
df<- data.frame(employee,salary, startdate, enddate, N)

I want to repeat the entire row by "n" number of times where "n" is specified in the column N but I want to change the enddate in the original row to be fixed date such as "31/12/2014" and make this fixed date as startdate in the repeating row. Run the code to see the resultant example (expected output)in df2:

employee <- c(rep("John",2), "Adi", rep("Sam",2))
salary <- c(21000,21000, 22000, 23000,23000)
startdate <- as.Date(c("2014-11-01","2014-12-31", "2014-01-01","2014-10-01","2014-12-31"))
enddate <- as.Date(c("2014-12-31","2015-10-31","2014-12-31","2014-12-31","2015-10-31"))
N<- c(2,2,1,2,2)
df2<- data.frame(employee,salary, startdate, enddate, N)

Solution

  • We could do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), expand the rows by replicating the 'N' variable. We get the numeric index ('i1') of the observation (.I[1L]) grouped by 'employee', use that to assign (:=) the 'enddate' with '2014-12-31'. Similarly, we get the row index ('i2') of 2nd to last element (.I[seq_len(.N)>1L]) for each 'employee' and set the 'startdate' as '2014-12-31'.

    DT <- setDT(df)[rep(seq_len(.N), N)]
    i1 <- DT[,  .I[1L] , by = employee]$V1
    DT[i1, enddate:= as.Date('2014-12-31')]
    i2 <- DT[, .I[seq_len(.N)>1L], employee]$V1
    DT[i2, startdate:= as.Date('2014-12-31')]
    identical(as.data.table(df2), DT)
    #[1] TRUE
    

    Or we can do this using if and concatenate the '2014-12-31' for the 'startdate' and 'enddate', grouped by 'employee' and then assign the output back to the columns 'startdate', 'enddate'.

    DT[, c('startdate', 'enddate') := if(.N>1L) 
                list(c(startdate[1L], as.Date('2014-12-31')),
                     c(as.Date('2014-12-31'), enddate[.N])) , by = employee]
    identical(DT, as.data.table(df2))
    #[1] TRUE