Search code examples
rcluster-analysisdistancecorrelationsimilarity

How can I convert a correlation or similarity table to 696x696 matrix


So here is the full dataset in pastebin: https://pastebin.com/xpGMsSSf

Quick snap of the pastebin:

`"V1","V2","N"
16,17,0.065532029
16,30,0.070163826
17,30,0.053089888
29,30,0.068024596`

Pre-processing of the data: I started with a list of customer orders and items within each order. I calculated number of occurrences of each pair of items within the same order. Then, I used the "Jackkard Index" to calculate similarities between the items. Now I am where you can see in the dataset.

The dataset: The dataset contains the material numbers in V1 and V2. N = similarity index between the items. The dataset only contains the pair of items that occurred together in the same order. Therefore, the are a lot of pairs that are not in the dataset.

My objective: I have 696 unique item numbers ranging from 1-696. I want a 696x696 matrix by N as values. The values for "missing pairs" in the dataset should be equal to zero = meaning that there is no similarity between the two items.

What am I going to use this for? I want to cluster the 696 items based on their number of occurrences within the same order.


Solution

  • xtabs can be used to get the data in the form you want - it also has the nice feature that you can specify the outcome as a sparse matrix (which yours is (nrow(dat)/696^2)

    dat <- read.csv("https://pastebin.com/raw/xpGMsSSf")
    
    # setting to factor introduces factor levels that are not found in the data
    # see below for what is being done
    dat[c("V1", "V2")] <- lapply(dat[c("V1", "V2")], factor, levels=1:696)
    
    out <- xtabs( N ~ V1 + V2, dat, sparse=TRUE)
    
    out[1:5, 1:5]
    
    # To make symmetric
    library(Matrix)
    out[lower.tri(out)] <- t(out)[lower.tri(out)]
    

    # Explanation of setting common factor levels
    # example
    x = c(1,2,3)
    y = c(1,4,5)
    table(x, y)
    # but if we want both row and columns of table to include 1 to 5
    # we can set to factor
    x = factor(x, levels=1:5)
    y = factor(y, levels=1:5)
    table(x, y)
    

    dput(head(mat))
    structure(list(V1 = c(16L, 16L, 17L, 29L, 16L, 17L), V2 = c(17L, 
    30L, 30L, 30L, 29L, 29L), N = c(0.065532029, 0.070163826, 0.053089888, 
    0.068024596, 0.053083392, 0.041870099)), .Names = c("V1", "V2", 
    "N"), row.names = c(NA, 6L), class = "data.frame")