So here is the full dataset in pastebin: https://pastebin.com/xpGMsSSf
Quick snap of the pastebin:
`"V1","V2","N"
16,17,0.065532029
16,30,0.070163826
17,30,0.053089888
29,30,0.068024596`
Pre-processing of the data: I started with a list of customer orders and items within each order. I calculated number of occurrences of each pair of items within the same order. Then, I used the "Jackkard Index" to calculate similarities between the items. Now I am where you can see in the dataset.
The dataset: The dataset contains the material numbers in V1 and V2. N = similarity index between the items. The dataset only contains the pair of items that occurred together in the same order. Therefore, the are a lot of pairs that are not in the dataset.
My objective: I have 696 unique item numbers ranging from 1-696. I want a 696x696 matrix by N as values. The values for "missing pairs" in the dataset should be equal to zero = meaning that there is no similarity between the two items.
What am I going to use this for? I want to cluster the 696 items based on their number of occurrences within the same order.
xtabs
can be used to get the data in the form you want - it also has the nice feature that you can specify the outcome as a sparse matrix (which yours is (nrow(dat)/696^2
)
dat <- read.csv("https://pastebin.com/raw/xpGMsSSf")
# setting to factor introduces factor levels that are not found in the data
# see below for what is being done
dat[c("V1", "V2")] <- lapply(dat[c("V1", "V2")], factor, levels=1:696)
out <- xtabs( N ~ V1 + V2, dat, sparse=TRUE)
out[1:5, 1:5]
# To make symmetric
library(Matrix)
out[lower.tri(out)] <- t(out)[lower.tri(out)]
# Explanation of setting common factor levels
# example
x = c(1,2,3)
y = c(1,4,5)
table(x, y)
# but if we want both row and columns of table to include 1 to 5
# we can set to factor
x = factor(x, levels=1:5)
y = factor(y, levels=1:5)
table(x, y)
dput(head(mat))
structure(list(V1 = c(16L, 16L, 17L, 29L, 16L, 17L), V2 = c(17L,
30L, 30L, 30L, 29L, 29L), N = c(0.065532029, 0.070163826, 0.053089888,
0.068024596, 0.053083392, 0.041870099)), .Names = c("V1", "V2",
"N"), row.names = c(NA, 6L), class = "data.frame")