r performance optimization matrix sparse-matrix

Optimizing Speed for Populating a Matrix

I am trying to fill in a large matrix (55920484 elements) in R that will eventually be symmetric (so I am actually only performing calculations for half of the matrix). The resulting values matrix is a square matrix which has the same row and column names. Each value in the matrix is the result of comparing unique lists and counting the number of intersections. This data comes from a larger dataframe (427.5 Mb). Here is my fastest solution so far, I am trying to get rid of the loops which I know are slow:

for(i in 1:length(rownames(values))){
  for(j in i:length(colnames(values))){
    A = data[data$Stock==rownames(values)[i],"Fund"]
    B = data[data$Stock==colnames(values)[j],"Fund"]
    values[i, j] = length(intersect(A, B))
  }
}

I have tried several other approaches such as using a database with an SQL connection, using a sparse matrix with 0s and 1s, and using the sqldf package in R.

Here is the structure of my data:

head(data)

  Fund                          Stock Type Shares.Held Maket.Value X..of.Portfolio Rank Change.in.Shares X..Change X..Ownership
1 12 WEST CAPITAL MANAGEMENT LP  GRUB CALL      500000    12100000          0.0173   12           500000       New          N/A
2 12 WEST CAPITAL MANAGEMENT LP  FIVE   SH      214521     6886000          0.0099   15           214521       New            0
3 12 WEST CAPITAL MANAGEMENT LP  SHAK   SH      314114    12439000          0.0178   11           307114      4387            1
4 12 WEST CAPITAL MANAGEMENT LP  FRSH   SH      324120     3650000          0.0053   16          -175880       -35            2
5 12 WEST CAPITAL MANAGEMENT LP  ATRA   SH      393700    10398000          0.0149   14           162003        69            1
6 12 WEST CAPITAL MANAGEMENT LP  ALNY   SH      651000    61285000          0.0875    4        No Change         0            1

Solution

I see three problems, in order of increasing importance:

(1) You call rownames(values) and colnames(values) many times, instead of calling them just once outside of the loops. It may or may not help.

(2) You calculate A = data[data$Stock==rownames(values)[i],"Fund"] under the innermost loop, while you should calculate it outside of this loop.

(3) Most important: Your code uses only two columns of your table: Fund and Stock. I see that in your data there are many rows with same both Fund and Stock. You should eliminate this redundacy. Maybe you want to create data1=data[,c("Fund","Stock")] and eliminate redundant rows in data1 (without loop):

data1 = data1[,order(data1[,"Fund"])]
len = nrow(data1)
good = c(TRUE,data1[-len,1]!=data1[-1,1]|data1[-len,2]!=data1[-1,2])
data1 = data1[good,]

(I did not test the code above)

Maybe you want to go further and create the list, which, for each fund, specifies what stocks it contains, without redundancies.

PS: you can still create the list which, for each stock, specifies what funds have it:

rv = rownames(values)
len = length(rv)
fund.list = list()
for (i in 1:len)
    fund.list[[,i]] = data[data$Stock==rv[i],"Fund"]
for (i in 1:len) {
    A = fund.list[[i]]
    for (j in i:len) {
        values[i, j] = length(intersect(A, fund.list[[j]]))
    }
}