Search code examples
rdataframeformulacalculus

Create a new data frame based on another dataframe


I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.

I have the following dataframe

df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))

df1
  LOCAT START  END CODE1 CODE2
1     1   120  150     1     1
2     2   345  390     1     0
3     3   765  802     0     0
4     4  1045 1120     1     0
5     5  1347 1436     0    -1
6     6  1879 1935     0    -1

This is a sample dataframe. The rows continue until 180000 and the columns are over 400. What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.

Something like this for CODE1:

   LOCAT SIZE VALUE
1 1 to 2  270   POS
2 4 to 4   75   POS

And like this for CODE2:

   LOCAT SIZE VALUE
1 1 to 1   30   POS
2 5 to 6  588   NEG

Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.

Any help would be appreciated. Thanks in advance


Solution

  • Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).

    It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".

    # need package "plyr"
    library("plyr")
    
    # test2 is the example data frame that you gave in the question
    test2 <- data.frame(
        "LOCAT"=1:6, 
        "START"=c(120,345,765, 1045, 1347, 1879), 
        "END"=c(150,390,803,1120,1436, 1935), 
        "CODE1"=c(1,1,0,1,0,0),
        "CODE2"=c(1,0,0,0,-1,-1)
        )
    
    codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
    test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
    test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
    
    sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
        rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
        stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
        starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
        ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
        ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
    
        # The next 3 lines calculate the equivalent of your SIZE column
        sizeX1 <- x[ssX[,"Starts"],"START"]
        sizeX2 <- x[ssX[,"Stops"],"END"]
        sizeX <- sizeX2 - sizeX1
    
        sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
        return(sssX) #Added in EDIT
    
    }
    
    answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
    answer <- answer0 # duplicate the original, new version will be reformatted
    answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
    

    Hopefully this helps, good luck!