Search code examples
rdataframedot-productr-daisy

computing the dot product between all column pairs in a data frame


I have an R data frame which columns are logical variables. I need to make some kind of dot product between all possible pairs of columns.

This arise from text corpus analysis, where the data frame indicates which terms (rows) are present in which documents (columns). There are common, fast solutions for the case where one wishes to compute distances with each possible possible pairs of columns, using daisy from the cluster package or cosine from the lsa package.

I would however need to use some kind of dot product between all pairs of columns instead : the goal is to count how many words are simultaneously present in both documents been compared (and this, for each pair).


Solution

  • Let's use this example:

    df <- data.frame(x1 = c(T, T, F), x2 = c(F, F, F), x3 = c(T, F, T))
    

    I would turn the data.frame into a matrix then compute the crossproduct:

    crossprod(data.matrix(df))
    #    x1 x2 x3
    # x1  2  0  1
    # x2  0  0  0
    # x3  1  0  2