I am trying to fill in a large matrix (55920484 elements) in R that will eventually be symmetric (so I am actually only performing calculations for half of the matrix). The resulting values matrix is a square matrix which has the same row and column names. Each value in the matrix is the result of comparing unique lists and counting the number of intersections. This data comes from a larger dataframe (427.5 Mb). Here is my fastest solution so far, I am trying to get rid of the loops which I know are slow:
for(i in 1:length(rownames(values))){
for(j in i:length(colnames(values))){
A = data[data$Stock==rownames(values)[i],"Fund"]
B = data[data$Stock==colnames(values)[j],"Fund"]
values[i, j] = length(intersect(A, B))
}
}
I have tried several other approaches such as using a database with an SQL connection, using a sparse matrix with 0s and 1s, and using the sqldf package in R.
Here is the structure of my data:
head(data)
Fund Stock Type Shares.Held Maket.Value X..of.Portfolio Rank Change.in.Shares X..Change X..Ownership
1 12 WEST CAPITAL MANAGEMENT LP GRUB CALL 500000 12100000 0.0173 12 500000 New N/A
2 12 WEST CAPITAL MANAGEMENT LP FIVE SH 214521 6886000 0.0099 15 214521 New 0
3 12 WEST CAPITAL MANAGEMENT LP SHAK SH 314114 12439000 0.0178 11 307114 4387 1
4 12 WEST CAPITAL MANAGEMENT LP FRSH SH 324120 3650000 0.0053 16 -175880 -35 2
5 12 WEST CAPITAL MANAGEMENT LP ATRA SH 393700 10398000 0.0149 14 162003 69 1
6 12 WEST CAPITAL MANAGEMENT LP ALNY SH 651000 61285000 0.0875 4 No Change 0 1
I see three problems, in order of increasing importance:
(1) You call rownames(values) and colnames(values) many times, instead of calling them just once outside of the loops. It may or may not help.
(2) You calculate A = data[data$Stock==rownames(values)[i],"Fund"]
under the innermost loop, while you should calculate it outside of this loop.
(3) Most important: Your code uses only two columns of your table: Fund and Stock. I see that in your data there are many rows with same both Fund and Stock. You should eliminate this redundacy. Maybe you want to create data1=data[,c("Fund","Stock")]
and eliminate redundant rows in data1 (without loop):
data1 = data1[,order(data1[,"Fund"])]
len = nrow(data1)
good = c(TRUE,data1[-len,1]!=data1[-1,1]|data1[-len,2]!=data1[-1,2])
data1 = data1[good,]
(I did not test the code above)
Maybe you want to go further and create the list, which, for each fund, specifies what stocks it contains, without redundancies.
PS: you can still create the list which, for each stock, specifies what funds have it:
rv = rownames(values)
len = length(rv)
fund.list = list()
for (i in 1:len)
fund.list[[,i]] = data[data$Stock==rv[i],"Fund"]
for (i in 1:len) {
A = fund.list[[i]]
for (j in i:len) {
values[i, j] = length(intersect(A, fund.list[[j]]))
}
}