Search code examples
rperformancedataframetranspose

Using which function to transpose parts of columns under condition


Suppose we have the following data:

X Y
  6
  1
  2
  2
1 1
  8
  3
  4
  1
1 2

I want to convert it to:

X Y Y-1 Y-2 Y-3
  6
  1
  2
  2
1 1  2   2   1
  8
  3
  4
  1
1 2  1   4   3

That is: for rows with X=1 - take 3 previous Y values and append them to this row.

I "brute-forced" it with a loop:

namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""

for(i in 1:nrow(mydata)){
  if(mydata$X[i] != ""){mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}
}

But it was too slow for my dataset of ~300k points - about 10 minutes.

Then I found a post with a similar question, and they proposed which function, which reduced the time to tolerable 1-2 minutes:

namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""
trials_rows <- which(mydata$X != "")

for (i in trials_rows) {mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}

But considering that which takes less than a second - I believe I can somehow combine which with some kind of transpose function, but I can't get my mind around it.

I have a big data frame (~300k rows), and ~6k rows have this "X" value.

Is there a fast and simple way to do it fast, instead of iterating through the results of which function?


Solution

  • You can do this with a single assignment using some vectorised trickery:

    mydata[trials_rows, namevector] <- mydata$Y[trials_rows - rep(1:3,each=length(trials_rows))]
    
    mydata
    #    X Y Y-1 Y-2 Y-3
    #1  NA 6            
    #2  NA 1            
    #3  NA 2            
    #4  NA 2            
    #5   1 1   2   2   1
    #6  NA 8            
    #7  NA 3            
    #8  NA 4            
    #9  NA 1            
    #10  1 2   1   4   3
    

    Basically, take each row in trials_rows, look backwards three rows using a vectorised subtraction, and then overwrite the combination of trials_rows in rows and namevector in columns.

    Reproducible example used here:

    mydata <- structure(list(X = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L), 
        Y = c(6L, 1L, 2L, 2L, 1L, 8L, 3L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, 
    -10L))