Search code examples
rdata.tablereshape2

Complex Reshaping of Data Frame, tracking record edits


I have a data frame that tracks whenever a project's record was edited, the time of the edit, what field was edited, the old value, and the new value. Columns OrderDate, Probability, and Total display what the value of those fields are today:

df.raw <- data.frame(project=rep(c('A','B'), each=4),
               createDate=as.Date(rep(c('2015-01-01','2017-05-01'), each=4)),
               editDate=as.Date(c('2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')), 
               editField=c('OrderDate', 'OrderDate','Probability','Probability', 'Total','Total', 'Probability','Total'),
               oldValue=c('2018-06-01','2019-05-01',20,30,500,550,70,400),
               newValue=c('2019-05-01','2019-06-01',30,50,550,400,30,450),
               OrderDate=as.Date(rep(c('2019-06-01','2019-01-01'), each=4)),
               Probability=rep(c(50,70), each=4),
               Total=rep(c(10,450), each=4))

  project createDate   editDate   editField   oldValue   newValue  OrderDate Probability Total
1       A 2015-01-01 2018-06-01   OrderDate 2018-06-01 2019-05-01 2019-06-01          50    10
2       A 2015-01-01 2019-04-01   OrderDate 2019-05-01 2019-06-01 2019-06-01          50    10
3       A 2015-01-01 2019-05-01 Probability         20         30 2019-06-01          50    10
4       A 2015-01-01 2019-06-01 Probability         30         50 2019-06-01          50    10
5       B 2017-05-01 2018-10-01       Total        500        550 2019-01-01          70   450
6       B 2017-05-01 2018-11-01       Total        550        400 2019-01-01          70   450
7       B 2017-05-01 2018-11-15 Probability         70         30 2019-01-01          70   450
8       B 2017-05-01 2019-01-01       Total        400        450 2019-01-01          70   450

I would like to transform this data frame so that:

  • There's an additional row for project creation.
  • Each row displays what the project's OrderDate, Probability, and Total was at the time of the creation or edit.
  • If a field was never edited, the field will always be equal to the final OrderDate,Probability, or Total value for that project.

The end result would look something like this:

df.reshaped <- data.frame(project=rep(c('A','B'), each=5),
           editDate=as.Date(c('2015-01-01','2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2017-05-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')),
           editField=c('Created','OrderDate', 'OrderDate','Probability','Probability','Created', 'Total','Total', 'Probability','Total'),
           OrderDateAtEdit=as.Date(c('2018-06-01','2019-05-01','2019-06-01','2019-06-01','2019-06-01',rep('2019-01-01', 5))),
           ProbabilityAtEdit=c(20,20,20,30,50,70,70,70,30,30),
           TotalAtEdit=c(10,10,10,10,10,500,550,400,400,450))
   project   editDate   editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
1        A 2015-01-01     Created      2018-06-01                20          10
2        A 2018-06-01   OrderDate      2019-05-01                20          10
3        A 2019-04-01   OrderDate      2019-06-01                20          10
4        A 2019-05-01 Probability      2019-06-01                30          10
5        A 2019-06-01 Probability      2019-06-01                50          10
6        B 2017-05-01     Created      2019-01-01                70         500
7        B 2018-10-01       Total      2019-01-01                70         550
8        B 2018-11-01       Total      2019-01-01                70         400
9        B 2018-11-15 Probability      2019-01-01                30         400
10       B 2019-01-01       Total      2019-01-01                30         450

I'm not sure where to begin with this, any help would be greatly appreciated! Thanks.


Solution

  • I think the data has been amalgamated together, you need to split them into an events table and an edits table:

    library(data.table)
    setDT(df.raw)
    
    #created the events table with the available values first
    cols <- c("OrderDate", "Probability", "Total")
    events <- df.raw[, setnames(rbindlist(.(.(createDate[1L], "Created"), 
        .(editDate, editField))), c("editDate","editField")), project]
    events[unique(df.raw, by=c("project", "Probability", "Total")), on=.(project), 
        paste0(cols, "AtEdit") := lapply(mget(cols), as.character)]
    
    #historical edits in another table
    edits <- df.raw[, .(startDate=c(createDate[1L], editDate), 
        endDate=c(editDate, as.Date("9999-12-31")),
        value=c(oldValue, newValue[.N])), .(project, editField)]
    
    #perform non-equi joins to update events table
    for (x in cols) {
        cn <- paste0(x, "AtEdit")
        v <- edits[editField==x][events, on=.(project, startDate<=editDate, endDate>editDate), value] 
        events[, (cn) := fifelse(is.na(v), get(cn), as.character(v))]  
    }
    

    output:

        project   editDate   editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
     1:       A 2015-01-01     Created      2018-06-01                20          10
     2:       A 2018-06-01   OrderDate      2019-05-01                20          10
     3:       A 2019-04-01   OrderDate      2019-06-01                20          10
     4:       A 2019-05-01 Probability      2019-06-01                30          10
     5:       A 2019-06-01 Probability      2019-06-01                50          10
     6:       B 2017-05-01     Created      2019-01-01                70         500
     7:       B 2018-10-01       Total      2019-01-01                70         550
     8:       B 2018-11-01       Total      2019-01-01                70         400
     9:       B 2018-11-15 Probability      2019-01-01                30         400
    10:       B 2019-01-01       Total      2019-01-01                30         450