Search code examples
rdata.tableconditional-statementsupdating

R data.table - update by summing over subsets coded by columns


I have the following problem. I have a list of sets encoded in a data.table sets where id.s encodes the id of the set and id.e encodes its element. For each set s there is its value m(s). Values of the function m() are in another data.table m where each row contains an id of the set id.s and its value.

sets <- data.table(
    id.s = c(1,2,2,3,3,3,4,4,4,4),
    id.e = c(3,3,4,2,3,4,1,2,3,4))

v <- data.table(id.s = 1:4, value = c(1/10,2/10,3/10,4/10))

I need to calculate new function v'() such that

formula

where |s| denoted the cardinality of the set s (the number of elements) and b \ a denotes sets subtraction (a way of modifying a set b by removing the joint elements with set a)

Right now, I do it using a for-loop where I update row by row. Nevertheless, it takes too much time for large data.tables with thousands of sets with thousands of elements.

Do you have any idea how to make it easier?

My current code:

# convert data.table to wide format 
dc <- dcast(sets, id.s ~ id.e, drop = FALSE, value.var = "id.e" , fill = 0)
# take columns corresponding to elements id.e
cols <- names(dc)[-1]
# convert columns cols to 0-1 coding
dc[, (cols) := lapply(.SD, function(x) ifelse(x > 0,1,0)), .SDcols = cols]

# join dc with v
dc <- dc[v, on = "id.s"]

# calculate the cardinality of each set
dc[, cardinality := sum(.SD > 0), .SDcols = cols, by = id.s]

# prepare column for new value
dc[, value2 := 0]

#   id.s 1 2 3 4 value cardinality value2
#1:    1 0 0 1 0   0.1           1      0
#2:    2 0 0 1 1   0.2           2      0
#3:    3 0 1 1 1   0.3           3      0
#4:    4 1 1 1 1   0.4           4      0

# for each set (row of dc)
for(i in 1:nrow(dc)) {
  row <- dc[i,]
  set <- as.numeric(row[,cols, with = F])
  row.cardinality <- as.numeric(row$cardinality)
  # find its supersets
  dc[,is.superset := ifelse(rowSums(mapply("*",dc[,cols,with=FALSE],set))==row.cardinality,1,0)][]
  # use the formula to update the value
  res <- dc[is.superset==1,][, sum := sum((-1)^(cardinality - row.cardinality)*value)]$sum[1]
  dc[i,value2 := res]
}

dc[,.(id.s, value2), with = TRUE]
#   id.s value2
#1:    1   -0.2
#2:    2    0.3
#3:    3   -0.1
#4:    4    0.4


Solution

  • This might work for you:

    Make a little function to get the superset for each set

    get_superset <- function(el, setvalue) {
      c(setvalue, sets[id.s!=setvalue, setequal(intersect(el, id.e), el), by=id.s][V1==TRUE, id.s])
    }
    
    1. Get cardinality of each set in the sets object, but also save separately for later use (see step 4)
    sets[, cardinality:=.N, by=.(id.s)]
    cardinality = unique(sets[, .(id.s, cardinality)])
    
    1. Add supersets, by set, using above function
    sets <- unique(sets[,!c("id.e")][sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s], on=.(id.s)])
    

    (Note: As an alternative, step 2 could be broken into three sub-steps, like this)

    # 2a. Get the supersets
    supersets = sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s]
    # 2b. Merge the supersets on the original sets 
    sets = sets[supersets, on=.(id.s)]
    # 2c. Retain only necessary columns, and make unique
    sets = unique(sets[, .(id.s, cardinality,supersets)])
    
    1. add value
    sets <- sets[v,on=.(supersets=id.s)][order(id.s)]
    
    1. grab cardinality of each superset
    sets <- sets[cardinality, on=.(supersets=id.s)]
    
    1. get the result (i.e. estimate your v' function)
    result = sets[, .(value2 = sum((-1)^(i.cardinality-cardinality)*value)), by=.(id.s)]
    

    Output:

       id.s value2
    1:    1   -0.2
    2:    2    0.3
    3:    3   -0.1
    4:    4    0.4