Data format CSV
Total number of documents 500. number of fields 10.
i want to calculate parallel cosine similarity of Each "Docs" with all 500 documents,
expected out put
Does this do what you want? To compute the similarity of all (500*499)/2 combinations, you can do something like this:
# Create some mock data
df <-replicate(10, rnorm(500))
rownames(df) <- paste0("doc", seq_len(nrow(df)))
colnames(df) <- paste0("field", seq_len(ncol(df)))
# Vector lengths
vl <- sqrt(rowSums(df*df))
# Matrix of all combinations
comb <- t(combn(1:nrow(df), 2))
# Compute cosine similarity for all combinations
csim <- apply(comb, 1, FUN = function(i) sum(apply(df[i, ], 2, prod))/prod(vl[i]))
# Create a data.frame of the results
res <- data.frame(docA = rownames(df)[comb[,1]],
docB = rownames(df)[comb[,2]],
csim = csim)
head(res)
# docA docB csim
#1 doc1 doc2 -0.6431972
#2 doc1 doc3 -0.2560444
#3 doc1 doc4 -0.4911942
#4 doc1 doc5 -0.2207487
#5 doc1 doc6 0.4764924
#6 doc1 doc7 0.5867607
tail(res)
# docA docB csim
#124745 doc497 doc498 1.0714338
#124746 doc497 doc499 0.8439304
#124747 doc497 doc500 1.1806366
#124748 doc498 doc499 0.9326781
#124749 doc498 doc500 1.4783254
#124750 doc499 doc500 1.3626494
Note, it does not really make sense to have the original vector values of the fields in this output table. Each number is a comparison and coputation of two rows in your data.
Edit:
If you want it no matrix form, you can compute it directly by:
res_mat <- tcrossprod(df)/tcrossprod(vl)
print(res_mat[1:5, 1:5])
# doc1 doc2 doc3 doc4 doc5
#doc1 1.0000000 -0.6431972 -0.2560444 -0.4911942 -0.2207487
#doc2 -0.6431972 1.0000000 0.3996618 0.3365490 -0.1434239
#doc3 -0.2560444 0.3996618 1.0000000 0.2856842 0.2781019
#doc4 -0.4911942 0.3365490 0.2856842 1.0000000 0.2287057
#doc5 -0.2207487 -0.1434239 0.2781019 0.2287057 1.0000000