Search code examples
rsparse-matrix

R rearrange data


I have a bunch of texts written by the same person, and I'm trying to estimate the templates they use for each text. The way I'm going about this is:

  1. create a TermDocumentMatrix for all the texts
  2. take the raw Euclidean distance of each pair
  3. cut out any pair greater than X distance (10 for the sake of argument)
  4. flatten the forest
  5. return one example of each template with some summarized stats

I'm able to get to the point of having the distance pairs, but I am unable to convert the dist instance to something I can work with. There is a reproducible example at the bottom.

The data in the dist instance looks like this:

dist instance example

The row and column names correspond to indexes in the original list of texts which I can use to do achieve step 5.

What I have been trying to get out of this is a sparse matrix with col name, row name, value.

col, row, value
  1    2  14.966630
  1    3  12.449900
  1    4  13.490738
  1    5  12.688578
  1    6  12.369317
  2    3  12.449900
  2    4  13.564660
  2    5  12.922848
  2    6  12.529964
  3    4   5.385165
  3    5   5.830952
  3    6   5.830952
  4    5   7.416198
  4    6   7.937254
  5    6   7.615773

From this point I would be comfortable cutting out all pairs greater than my cutoff and flattening the forest, i.e. returning 3 templates in this example, a group containing only document 1, a group containing only document 2 and a third group containing documents 3, 4, 5, and 6.

I have tried a bunch of things from creating a matrix out of this and then trying to make it sparse, to directly using the vector inside of the dist class, and I just can't seem to figure it out.

Reproducible example:

tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6)
rownames(tdm) <- 1:6
colnames(tdm) <- paste("term", 1:229, sep="")
tdm.dist <- dist(tdm)
# I'm stuck turning tdm.dist into what I have shown

Solution

  • A classic approach to turn a "matrix"-like object to a [row, col, value] "data.frame" is the as.data.frame(as.table(.)) route. Specifically here, we need:

    subset(as.data.frame(as.table(as.matrix(tdm.dist))), as.numeric(Var1) < as.numeric(Var2))
    

    But that includes way too many coercions and creation of a larger object only to be subset immediately.

    Since dist stores its values in a "lower.tri"angle form we could use combn to generate the row/col indices and cbind with the "dist" object:

    data.frame(do.call(rbind, combn(attr(tdm.dist, "Size"), 2, simplify = FALSE)), c(tdm.dist))
    

    Also, "Matrix" package has some flexibility that, along its memory efficiency in creating objects, could be used here:

    library(Matrix)
    tmp = combn(attr(tdm.dist, "Size"), 2)
    summary(sparseMatrix(i = tmp[2, ], j = tmp[1, ], x = c(tdm.dist), 
                         dims = rep_len(attr(tdm.dist, "Size"), 2), symmetric = TRUE))
    

    Additionally, among different functions that handle "dist" objects,

    cutree(hclust(tdm.dist), h = 10)
    #1 2 3 4 5 6 
    #1 2 3 3 3 3
    

    groups by specifying the cut height.