To begin, I have been able to put together a nested for loop to create the object I am after, and it works OK for a small toy data set, but the data I will be working with in general will be larger and I am trying to determine if a package exists in R with a built in function to accomplish this task.
The final object is a data frame or matrix that shows the conditional percentage in a column given the reference row. Here is the code for a toy data and my nested for loop that generates the final output object.
mylist <- list(
ID001=c("apple","orange","grape"),
ID002=c("banana","grape"),
ID003=c("apple","pineapple"),
ID004=c("orange","apple"),
ID005=c("orange","grape", "apple"))
dat <- reshape2:::melt(mylist)
names(dat) <- c("fruit","id")
dat <- dat[,c(2,1)]
theFruit <- unique(dat$fruit)
n=length(theFruit)
final.df <- data.frame(matrix(nrow=n,ncol=n, dimnames=list(theFruit,theFruit)))
for(i in theFruit){
for(j in theFruit){
tempid1 <- dat[dat$fruit==i,]$id
tempid2 <- dat[dat$fruit==j,]$id
final.df[i,j] <- round(length(which(tempid1%in%tempid2))/length(tempid1),2)
}
}
final.df
apple orange grape banana pineapple
apple 1.00 0.75 0.50 0.00 0.25
orange 1.00 1.00 0.67 0.00 0.00
grape 0.67 0.67 1.00 0.33 0.00
banana 0.00 0.00 1.00 1.00 0.00
pineapple 1.00 0.00 0.00 0.00 1.00
Reading the output we see that, given a person ate an apple (apple row), 75% also ate an orange (orange column). Similarly, given a person ate an orange (orange row) 100% also ate an apple (apple column). This is not intended to be symmetric with intersections of the two fruits eaten, it is column conditioned on row.
This seems to be akin to a market basket analysis application and I have been working with the arules package the past few days to get at this. In the vernacular of the arules package, I would say the name of the percentages populating the data frame are support values but I have not been able to generate a matrix or data frame of all of the support percentages from arules.
The data I will be working with will have a couple million IDs but only about 150 "products" so the output matrix would only be about 150x150. I can use arules to identify the compelling pairwise relationships but there is interest in seeing ALL of the conditionals.
Does anyone know if arules or another package can accomplish this?
You are looking for the confidence values (Wikipedia). You get a similar output to yours like this with arules
:
library(arules)
library(reshape2)
trans <- as(mylist, "transactions")
rules <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen=2, maxlen=2))
df <- inspect(rules)[, c("lhs", "rhs", "confidence")]
dcast(df, lhs~rhs, value.var="confidence", fill=1)
# lhs {apple} {banana} {grape} {orange} {pineapple}
# 1 {apple} 1.0000000 0.0000000 0.5000000 0.7500000 0.25
# 2 {banana} 0.0000000 1.0000000 1.0000000 0.0000000 0.00
# 3 {grape} 0.6666667 0.3333333 1.0000000 0.6666667 0.00
# 4 {orange} 1.0000000 0.0000000 0.6666667 1.0000000 0.00
# 5 {pineapple} 1.0000000 0.0000000 0.0000000 0.0000000 1.00
Of course you can make the first column to row names and convert the data frame to a matrix later on. I leave it up to you.