As it can be seen in the picture, I have a column with order numbers and a column with material numbers.
I want to find how often a pair of materials occur in the same order.
The problem is that I have 30000 rows of order numbers and 700 unique material numbers. Is it even possible?
I was thinking if it was easier to make a matrix with the 700 material numbers both in rows and column, and count number occurrences.
EDIT: The first picture was not a good example. I uploaded this second picture with random material numbers. So I want it to count for each pair (example 10-11, as I highlighted), how many times the appear in the same order. As it can be seen, 10&11 appear in 3 different orders.
The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.
Here's how the code would look like:
First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.
all_materials = levels(as.factor(X$Materials))
number_materials = length(all_materials)
Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))
(Here, X is your dataset)
We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.
colnames(Pairs) <- all_materials
rownames(Pairs) <- all_materials
Then we iterate over the dataset
for(order in levels(as.factor(X$Order.number))){
# getting the materials in each order
materials_for_order = X[X$Order.number==order, "Materials"]
if (length(materials_for_order)>1) {
# finding each possible pair from the materials list
all_pairs_in_order = combn(x=materials_for_order, m=2)
# incrementing the cell at the line and column corresponding to each pair
for(i in 1:ncol(all_pairs_in_order)){
Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1
}
}
}
At the end of the loop, the Pairs
table should contain everything you need.