r data.table conditional-statements updating

R data.table - update by summing over subsets coded by columns

I have the following problem. I have a list of sets encoded in a data.table sets where id.s encodes the id of the set and id.e encodes its element. For each set s there is its value m(s). Values of the function m() are in another data.table m where each row contains an id of the set id.s and its value.

sets <- data.table(
    id.s = c(1,2,2,3,3,3,4,4,4,4),
    id.e = c(3,3,4,2,3,4,1,2,3,4))

v <- data.table(id.s = 1:4, value = c(1/10,2/10,3/10,4/10))

I need to calculate new function v'() such that

where |s| denoted the cardinality of the set s (the number of elements) and b \ a denotes sets subtraction (a way of modifying a set b by removing the joint elements with set a)

Right now, I do it using a for-loop where I update row by row. Nevertheless, it takes too much time for large data.tables with thousands of sets with thousands of elements.

Do you have any idea how to make it easier?

My current code:

# convert data.table to wide format 
dc <- dcast(sets, id.s ~ id.e, drop = FALSE, value.var = "id.e" , fill = 0)
# take columns corresponding to elements id.e
cols <- names(dc)[-1]
# convert columns cols to 0-1 coding
dc[, (cols) := lapply(.SD, function(x) ifelse(x > 0,1,0)), .SDcols = cols]

# join dc with v
dc <- dc[v, on = "id.s"]

# calculate the cardinality of each set
dc[, cardinality := sum(.SD > 0), .SDcols = cols, by = id.s]

# prepare column for new value
dc[, value2 := 0]

#   id.s 1 2 3 4 value cardinality value2
#1:    1 0 0 1 0   0.1           1      0
#2:    2 0 0 1 1   0.2           2      0
#3:    3 0 1 1 1   0.3           3      0
#4:    4 1 1 1 1   0.4           4      0

# for each set (row of dc)
for(i in 1:nrow(dc)) {
  row <- dc[i,]
  set <- as.numeric(row[,cols, with = F])
  row.cardinality <- as.numeric(row$cardinality)
  # find its supersets
  dc[,is.superset := ifelse(rowSums(mapply("*",dc[,cols,with=FALSE],set))==row.cardinality,1,0)][]
  # use the formula to update the value
  res <- dc[is.superset==1,][, sum := sum((-1)^(cardinality - row.cardinality)*value)]$sum[1]
  dc[i,value2 := res]
}

dc[,.(id.s, value2), with = TRUE]
#   id.s value2
#1:    1   -0.2
#2:    2    0.3
#3:    3   -0.1
#4:    4    0.4

Solution

This might work for you:

Make a little function to get the superset for each set

get_superset <- function(el, setvalue) {
  c(setvalue, sets[id.s!=setvalue, setequal(intersect(el, id.e), el), by=id.s][V1==TRUE, id.s])
}

Get cardinality of each set in the sets object, but also save separately for later use (see step 4)

sets[, cardinality:=.N, by=.(id.s)]
cardinality = unique(sets[, .(id.s, cardinality)])

Add supersets, by set, using above function

sets <- unique(sets[,!c("id.e")][sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s], on=.(id.s)])

(Note: As an alternative, step 2 could be broken into three sub-steps, like this)

# 2a. Get the supersets
supersets = sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s]
# 2b. Merge the supersets on the original sets 
sets = sets[supersets, on=.(id.s)]
# 2c. Retain only necessary columns, and make unique
sets = unique(sets[, .(id.s, cardinality,supersets)])

add value

sets <- sets[v,on=.(supersets=id.s)][order(id.s)]

grab cardinality of each superset

sets <- sets[cardinality, on=.(supersets=id.s)]

get the result (i.e. estimate your v' function)

result = sets[, .(value2 = sum((-1)^(i.cardinality-cardinality)*value)), by=.(id.s)]

Output:

   id.s value2
1:    1   -0.2
2:    2    0.3
3:    3   -0.1
4:    4    0.4