Search code examples
rperformancememory-managementffffbase

Efficient Combination and Operating on Large Data Frames


I have 2 relatively large data frames in R. I'm attempting to merge / find all combos, as efficiently as possible. The resulting df turns out to be huge (the length is dim(myDF1)[1]*dim(myDF2)[1]), so I'm attempting to implement a solution using ff. I'm also open to using other solutions, such as the bigmemory package to work-around these memory issues. I'm have virtually no experience with either of these packages.

Working example - assume I'm working with some data frame that looks similar to USArrests:

library('ff')
library('ffbase')


myNames <- USArrests

myNames$States <- rownames(myNames)
rownames(myNames) <- NULL

Now, I will fabricate 2 data frames, which represent some particular sets of observations from myNames. I'm going to try to reference them by their rownames later.

myDF1 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(3*1e5, 1, 50))], ncol = 3)))
myDF2 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(2*1e5, 1, 50))], ncol = 2)))


# unique combos:
myDF1 <- unique(myDF1)
myDF2 <- unique(myDF2)

For example, my first set of states in myDF1 are myNames[unlist(myDF1[1, ]), ]. Then I will find all combos of myDF1 and myDF2 using ikey :

# create keys:
myDF1$key <- ikey(myDF1)
myDF2$key <- ikey(myDF2)

startTime <- Sys.time()


# Create some huge vectors:
myVector1 <- ffrep.int(myDF1$key, dim(myDF2)[1])
myVector2 <- ffrep.int(myDF2$key, dim(myDF1)[1])


# This takes about 25 seconds on my machine:
print(Sys.time() - startTime)


# Sort one DF (to later combine with the other):
myVector2  <- ffsorted(myVector2)

# Sorting takes an additional 2.5 minutes:
print(Sys.time() - startTime)

1) Is there a faster way to sort this?

# finally, find all combinations:
myDF <- as.ffdf(myVector1, myVector2)

# Very fast:
print(Sys.time() - startTime)

2) Is there an alternative to this type of combination (without using RAM)?

Finally, I'd like to be able to reference any of the original data by row / column. Specifically, I'd like to get different types of rowSums. For example:

# Here are the row numbers (from myNames) for the top 6 sets of States:
this <- cbind(myDF1[myDF[1:6,1], -4], myDF2[myDF[1:6,2], -3])
this

# Then, the original data for the first set of States is:
myNames[unlist(this[1,]),]

# Suppose I want to get the sum of the Urban Population for every row, such as the first:
sum(myNames[unlist(this[1,]),]$UrbanPop)

3) Ultimately, I'd like a vector with the above rowSum, so I can perform some type of subset on myDF. Any advice on how to most efficiently accomplish this?

Thanks!


Solution

  • It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase. The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.

    require(ffbase)
    x <- expand.ffgrid(myDF1$key, myDF2$key)