Search code examples
rfunctionfor-loopdata-managementlarge-data

Endless function/loop in R: Data Management


I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata). The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?

restructure <- function(data, firstcol, numcol, numsets){
    out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
    names(out) <- names(daten[0:(firstcol+numcol-1)])
      for(i in 1:nrow(daten)){
         fixdata <- (daten[i, 1:(firstcol-1)])

          for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
              flexdata <- daten[i, j:(j+numcol-1)]
              tmp <- cbind(fixdata, flexdata)
              names(tmp) <- names(daten[0:(firstcol+numcol-1)])
              out <- rbind(out,tmp)
          }  
      }
    out <- out[2:nrow(out),]
    return(out)
}

Thanks in advance!


Solution

  • Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.

    So, as Andrie tells you can look at melt.

    Or you can do it with core R: stack. Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols

    A third alternative would be array2df from package arrayhelpers.