Search code examples
rdcast

How to "fill" a variable with dcast


As part of a study of car-sharing, I'm using tables which look like this one :

enter image description here

The problem here is that there shouldn't be two lines : this is actually one single journey (same driver, same hour of departure) and "C" should be on the first row, in the Passenger2 column.

Hence, I'm trying to automatically "fill" the Passenger2 column when Date/Hour_dep/Driver are identical. ID_Dep is a different matter, it is automatically generated for each row, CDE002 simply should not exist.

From what I've read, dcast is my best way out of this. But I... can't figure out how to use it properly. I'm not even sure dcast is what I should be using, actually. At some point with my real table, I managed to get a contingency table with one column for every possible value of Passenger2. Had I been able to extract the first non-NA value starting from the leftmost column, that would have given me the table I wanted. But I didn't manage to get it, either.

Here's the code that recreates the table and my totally-not-working dcast, any help gladly welcome.

test_iddep<-c("AAA1","BBB2")
test_Date<-c("01/05/2019","01/05/2019")
test_hourdep <- c("8:00","8:00")
test_driv<-c("A","A")
test_pass1<-c("B","C")
test_pass2<-c(NA,NA)

test_table <- data.frame(test_iddep,test_Date,test_hourdep,test_driv,test_pass1,test_pass2)

table_arranged <- dcast(test_table, test_driv + test_Date + test_hourdep + test_pass1 ~ test_pass2, 
                        margins=c("test_driv","test_Date","test_hourdep")) 

Solution

  • I suspect there are a number of different approaches you can use (and probably better).

    If you did want to use dcast, here might be one approach:

    First, I would melt the data based on date, hour, and driver. Then would create an id for each of these unique combinations. Then for each id, would add a sequential number for each passenger. Then dcast:

    library(reshape2)
    
    test_table <- data.frame(
      test_iddep = c("AAA1", "BBB2", "CCC3", "DDD4", "EEE5"),
      test_Date = c("01/05/2019", "01/05/2019", "01/07/2019", "01/07/2019", "01/07/2019"),
      test_hourdep = c("8:00", "8:00", "10:00", "10:00", "10:00"),
      test_driv = c("A", "A", "B", "B", "B"),
      test_pass1 = c("B", "C", "D", "E", "F")
    )
    
    x <- melt(test_table[-1], id.vars = c("test_driv", "test_Date", "test_hourdep"))
    x$id <- cumsum(!duplicated(x[1:3]))
    x$time <- ave(x$id, x$id, FUN=seq_along)
    dcast(x, test_driv + test_Date + test_hourdep ~ time, value.var = "value")
    
      test_driv  test_Date test_hourdep 1 2    3
    1         A 01/05/2019         8:00 B C <NA>
    2         B 01/07/2019        10:00 D E    F
    

    Please let me know if this is what you had in mind for output. I had removed test_iddep as it sounded like this was not needed from your question.