Search code examples
rdata-miningsimilaritycosine-similarity

cosine similarity(patient similarity metric) between 48k patients data with predictive variables


I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.

This process will go on for all 48k patients.

I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a


Solution

  • First things first: You can find more rigorous treatments of cosine similarity at either of these posts:

    Now, you clearly have a mixture of data types in your input, at least

    • decimal
    • integer
    • categorical

    I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.

    Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:

    diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min

    I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like

    diasbp_rage <- diasbp_max - diasbp_min
    

    Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!

    After calculating the similarity, I try to find two different patients with the most dissimilar encounters.

    Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.

    Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.

    library(lsa)
    library(reshape2)
    
    psm_sample <- read.csv("psm_sample.csv")
    
    row.names(psm_sample) <-
      make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
    temp <- sapply(psm_sample, class)
    temp <- cbind.data.frame(names(temp), as.character(temp))
    names(temp) <- c("variable", "possible.type")
    
    numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
                       (!(grepl(
                         pattern = "_id$", x = temp$variable
                       ))) &
                       (!(
                         grepl(pattern = "_code$", x = temp$variable)
                       )) &
                       (!(
                         grepl(pattern = "_type$", x = temp$variable)
                       ))) | temp$possible.type == "numeric"
    
    psm_numerics <- psm_sample[, numeric.cols]
    row.names(psm_numerics) <- row.names(psm_sample)
    
    psm_numerics$gender <- as.integer(psm_numerics$gender)
    
    psm_scaled <- scale(psm_numerics)
    
    pair.these.up <- psm_scaled
    # checking for independence of variables
    # if the following PDF pair plot is too big for your computer to open,
    # try pair-plotting some random subset of columns
    # keep.frac <- 0.5
    # keep.flag <- runif(ncol(psm_scaled)) < keep.frac
    # pair.these.up <- psm_scaled[, keep.flag]
    # pdf device sizes are in inches
    dev <-
      pdf(
        file = "psm_pairs.pdf",
        width = 50,
        height = 50,
        paper = "special"
      )
    pairs(pair.these.up)
    dev.off()
    
    #transpose the dataframe to get the
    #similarity between patients
    cs <- lsa::cosine(t(psm_scaled))
    
    # this is super inefficnet, because cs contains
    # two identical triangular matrices
    cs.melt <- melt(cs)
    cs.melt <- as.data.frame(cs.melt)
    names(cs.melt) <- c("enc.A", "enc.B", "similarity")
    
    extract.pat <- function(enc.col) {
      my.patients <-
        sapply(enc.col, function(one.pat) {
          temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
          return(temp[[1]][[2]])
        })
      return(my.patients)
    }
    cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
    cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
    
    same.pat <-      cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
    different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
    
    most.dissimilar <-
      different.pat[which.min(different.pat$similarity),]
    
    dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
                                                 as.character(most.dissimilar$enc.A) ,],
                                  psm_numerics[rownames(psm_numerics) ==
                                                 as.character(most.dissimilar$enc.B) ,])
    
    print(t(dissimilar.pat.frame))
    

    which gives

                      patid.68.49   patid.9
    gender                1.00000   2.00000
    age                  41.85000  41.79000
    sysbp_min            72.00000 106.00000
    sysbp_max            95.00000 217.00000
    diasbp_min           42.00000  53.00000
    diasbp_max           61.00000 107.00000
    meanbp_min           52.00000  67.00000
    meanbp_max           72.00000 132.00000
    resprate_min         20.00000  14.00000
    resprate_max         35.00000  19.00000
    tempc_min            36.00000  35.50000
    tempc_max            37.55555  37.88889
    spo2_min             90.00000  95.00000
    spo2_max            100.00000 100.00000
    bicarbonate_min      22.00000  26.00000
    bicarbonate_max      22.00000  30.00000
    creatinine_min        2.50000   1.20000
    creatinine_max        2.50000   1.40000
    glucose_min          82.00000 129.00000
    glucose_max          82.00000 178.00000
    hematocrit_min       28.10000  37.40000
    hematocrit_max       28.10000  45.20000
    potassium_min         5.50000   2.80000
    potassium_max         5.50000   3.00000
    sodium_min          138.00000 136.00000
    sodium_max          138.00000 140.00000
    bun_min              28.00000  16.00000
    bun_max              28.00000  17.00000
    wbc_min               2.50000   7.50000
    wbc_max               2.50000  13.70000
    mingcs               15.00000  15.00000
    gcsmotor              6.00000   5.00000
    gcsverbal             5.00000   0.00000
    gcseyes               4.00000   1.00000
    endotrachflag         0.00000   1.00000
    urineoutput        1674.00000 887.00000
    vasopressor           0.00000   0.00000
    vent                  0.00000   1.00000
    los_hospital         19.09310   4.88130
    los_icu               3.53680   5.32310
    sofa                  3.00000   5.00000
    saps                 17.00000  18.00000
    posthospmort30day     1.00000   0.00000