Search code examples
rdata.tableapplysimilaritycosine-similarity

Apply (cosine) similarity measure over a datatable


I am looking for a sensible way of determining the similarity between project team members who have all been scored on four dimensions.

A data excerpt has been added here below and a slightly larger example is added at the end of the question in dput

pnum invid dom_st prim_st pat_st net_st
 1: 7265873 24104      0       1      1      0
 2: 7266757 38775      1       2      2      3
 3: 7266757 38776      1       2      2      3
 4: 7268524 34281      1       3      2      2
 5: 7268524 34282      1       3      2      2
 6: 7272620 20002      0       1      2      0
 7: 7272620 22284      0       1      2      0
 8: 7273253 31921      1       1      1      4
 9: 7273253 31922      1       1      1      4
10: 7283628 26841      1       1      1      2
11: 7283628 26843      1       1      1      2
12: 7289442 17763      2      11     48     10
13: 7289442 17764      2      11     63      9
14: 7289525 38087      0       1      1      0
15: 7289525 38088      0       2      1      0
16: 7289525 38089      0       3      1      1

The goal is to create a similarity measure for each 'pnum' that compares the four last column values across all 'invid'. The number of 'invid' per 'pnum' varies between 2 and 26.

EDIT 1: Concretely, for 'pnum' 7266757 (row 2 and 3) I want to similarity between th vector for invid 38775 (1,2,2,3) and invid 38776 (1,2,2,3) so this one should give a result of 1. For 'pnum' 7289525 (rows 14-16), I want the similarity between the three row-vectors (0,1,1,0), (0,2,1,0), and (0,3,1,1). This gives the below:

simil(matrix(c(0,1,1,0,0,2,1,0,0,3,1,1), nrow = 3, byrow = TRUE), method = "cosine")
          1         2
2 0.9486833          
3 0.8528029 0.9438798

In a final step (could be a separate formula), I would like "to reduce" that matrix (for teams of n > 2) to a single value that ideally would be constrained between 0 and 1. A simple way of doing so would be to just take the mean of the matrix result but perhaps there is a smarter way?

I tried the following (with data stored in data.table 'dt' but that gave the below error:

library('proxy')    
sim <- dt[, simil(dt, method="cosine"), by = pnum]
    Error in .Call("R_cosine", c(4262069, 4262069, 4262069, 4273567, 4273567, : negative length vectors are not allowed

Any suggestion to more successfully apply this or a similar function to a data.table and creative ideas for how to reduce a similarity matrix to a single point value would be very welcome.

The total dataset is about 150,000 rows with about 92,000 projects 'pnum'.

structure(list(pnum = c(7265873, 7266757, 7266757, 7268524, 7268524, 
7272620, 7272620, 7273253, 7273253, 7283628, 7283628, 7289442, 
7289442, 7289525, 7289525, 7289525, 7301987, 7301987, 7305259, 
7305259, 7307986, 7307986, 7310332, 7310332, 7333490, 7333490, 
7333502, 7333502, 7414991, 7414991), invid = c(24104, 38775, 
38776, 34281, 34282, 20002, 22284, 31921, 31922, 26841, 26843, 
17763, 17764, 38087, 38088, 38089, 34843, 38412, 32514, 33946, 
28587, 28588, 17204, 17205, 28587, 28588, 28587, 28588, 37008, 
37009), dom_st = c(0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 0, 
0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0), prim_st = c(1, 
2, 2, 3, 3, 1, 1, 1, 1, 1, 1, 11, 11, 1, 2, 3, 3, 3, 1, 1, 5, 
5, 3, 3, 5, 5, 5, 5, 3, 3), pat_st = c(1, 2, 2, 2, 2, 2, 2, 1, 
1, 1, 1, 48, 63, 1, 1, 1, 1, 1, 1, 1, 5, 5, 14, 14, 5, 5, 5, 
5, 1, 1), net_st = c(0, 3, 3, 2, 2, 0, 0, 4, 4, 2, 2, 10, 9, 
0, 0, 1, 2, 2, 0, 0, 2, 2, 4, 4, 2, 2, 2, 2, 0, 0)), .Names = c("pnum", 
"invid", "dom_st", "prim_st", "pat_st", "net_st"), class = c("data.table", 
"data.frame"), row.names = c(NA, -30L), .internal.selfref = <pointer: 0x0000000000230788>)

Solution

  • This works for me:

    library(data.table)
    setDT(DT)
    
    # find relevant columns for call to simil
    cols <- stringr::str_subset(names(DT), "_st$")
    cols
    #[1] "dom_st"  "prim_st" "pat_st"  "net_st"
    
    DT[, (mean(proxy::simil(.SD, method="cosine"))), .SDcols = cols, by = pnum]
    #       pnum        V1
    # 1: 7265873       NaN
    # 2: 7266757 1.0000000
    # 3: 7268524 1.0000000
    # 4: 7272620 1.0000000
    # 5: 7273253 1.0000000
    # 6: 7283628 1.0000000
    # 7: 7289442 0.9968006
    # 8: 7289525 0.9151220
    # 9: 7301987 1.0000000
    #10: 7305259 1.0000000
    #11: 7307986 1.0000000
    #12: 7310332 1.0000000
    #13: 7333490 1.0000000
    #14: 7333502 1.0000000
    #15: 7414991 1.0000000
    

    Note: I need to wrap the j expression in parantheses. Without, I do get an error messages which I don't understand:

    DT[, mean(proxy::simil(.SD, method="cosine")), .SDcols = cols, by = pnum]
    

    Error in FUN(X[[i]], ...) :
    Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.

    Edit 1

    If you want to get the similarity matrices for each pnum (before averaging them) I suggest to use lapply() which returns a list:

    pnums <- DT[, unique(pnum)]
    results <- lapply(pnums, function(x) {
      proxy::simil(DT[pnum == x, cols, with = FALSE], method="cosine")
    })
    setNames(results, pnums)
    #$`7265873`
    #simil(0)
    #
    #$`7266757`
    #  1
    #2 1
    #
    #$`7268524`
    #  1
    #2 1
    #
    #$`7272620`
    #  1
    #2 1
    #
    #$`7273253`
    #  1
    #2 1
    #
    #$`7283628`
    #  1
    #2 1
    #
    #$`7289442`
    #          1
    #2 0.9968006
    #
    #$`7289525`
    #          1         2
    #2 0.9486833          
    #3 0.8528029 0.9438798
    #
    #$`7301987`
    #  1
    #2 1
    #
    #$`7305259`
    #  1
    #2 1
    #
    #$`7307986`
    #  1
    #2 1
    #
    #$`7310332`
    #  1
    #2 1
    #
    #$`7333490`
    #  1
    #2 1
    #
    #$`7333502`
    #  1
    #2 1
    #
    #$`7414991`
    #  1
    #2 1
    

    Edit 2

    The OP has added an additional requirement that he wants to compute a number of aggregate values for each pnum. This can be achieved by

    DT[, {
      sim_mat <- proxy::simil(.SD, method="cosine")
      list(min = min(sim_mat), max = max(sim_mat), 
           mean = mean(sim_mat), sd = sd(sim_mat))
    }, .SDcols = cols, by = pnum]
    #       pnum       min       max      mean         sd
    # 1: 7265873       Inf      -Inf       NaN         NA
    # 2: 7266757 1.0000000 1.0000000 1.0000000         NA
    # 3: 7268524 1.0000000 1.0000000 1.0000000         NA
    # 4: 7272620 1.0000000 1.0000000 1.0000000         NA
    # 5: 7273253 1.0000000 1.0000000 1.0000000         NA
    # 6: 7283628 1.0000000 1.0000000 1.0000000         NA
    # 7: 7289442 0.9968006 0.9968006 0.9968006         NA
    # 8: 7289525 0.8528029 0.9486833 0.9151220 0.05402336
    # 9: 7301987 1.0000000 1.0000000 1.0000000         NA
    #10: 7305259 1.0000000 1.0000000 1.0000000         NA
    #11: 7307986 1.0000000 1.0000000 1.0000000         NA
    #12: 7310332 1.0000000 1.0000000 1.0000000         NA
    #13: 7333490 1.0000000 1.0000000 1.0000000         NA
    #14: 7333502 1.0000000 1.0000000 1.0000000         NA
    #15: 7414991 1.0000000 1.0000000 1.0000000         NA
    

    Data

    DT <- structure(list(pnum = c(7265873, 7266757, 7266757, 7268524, 7268524, 
    7272620, 7272620, 7273253, 7273253, 7283628, 7283628, 7289442, 
    7289442, 7289525, 7289525, 7289525, 7301987, 7301987, 7305259, 
    7305259, 7307986, 7307986, 7310332, 7310332, 7333490, 7333490, 
    7333502, 7333502, 7414991, 7414991), invid = c(24104, 38775, 
    38776, 34281, 34282, 20002, 22284, 31921, 31922, 26841, 26843, 
    17763, 17764, 38087, 38088, 38089, 34843, 38412, 32514, 33946, 
    28587, 28588, 17204, 17205, 28587, 28588, 28587, 28588, 37008, 
    37009), dom_st = c(0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 0, 
    0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0), prim_st = c(1, 
    2, 2, 3, 3, 1, 1, 1, 1, 1, 1, 11, 11, 1, 2, 3, 3, 3, 1, 1, 5, 
    5, 3, 3, 5, 5, 5, 5, 3, 3), pat_st = c(1, 2, 2, 2, 2, 2, 2, 1, 
    1, 1, 1, 48, 63, 1, 1, 1, 1, 1, 1, 1, 5, 5, 14, 14, 5, 5, 5, 
    5, 1, 1), net_st = c(0, 3, 3, 2, 2, 0, 0, 4, 4, 2, 2, 10, 9, 
    0, 0, 1, 2, 2, 0, 0, 2, 2, 4, 4, 2, 2, 2, 2, 0, 0)), .Names = c("pnum", 
    "invid", "dom_st", "prim_st", "pat_st", "net_st"), class = c("data.table", 
    "data.frame"), row.names = c(NA, -30L))