r algorithm graph-algorithm categorical-data dummy-variable

Combine maximum possible number of not mutually exclusive dummy variables into a categorical variable

This is an R specific version of a more abstract question asked here https://math.stackexchange.com/questions/2691617/

Let's say we have a lot of dummy variables that we want to use as controls in our regression. If these variables were mutually exclusive (i.e. only one of them can be equal to 1 in any specific observation), we would combine them into one categorical variable and use lfe package to "transform away" the resulting variable together with any other factor variables we have. In this way, we would avoid any snag associated with memory and performance constraints of doing regression on a very "wide" matrix.

However, if our dummies are not mutually exclusive, it is not possible to combine them all into one variable. For example, if we have people-month panel data with dummies indicating which medical condition a person had during any specific month, then if at least one person had high blood pressure and toothache together during the same month - these two conditions cannot be coded together into the same categorical variable.(To be clear: it is obviously possible to uniquely encode all combinations of conditions with one categorical variable, but that is not what is required).

We can still encode into a categorical variable at least some dummies that happen to be mutually exclusive in our specific data. For example, if high blood pressure and headache never co-occur in our hypothetical dataset, nothing prevents us from combining them into one categorical variable, even if such conditions are not mutually exclusive from a substantive point of view. Then we can "transform away" the categorical variable with lfe and leave the remaining dummies in the conventional part of estimation.

Question: What is an efficient way in R to encode the maximum possible number of mutually exclusive dummies from a larger set of not mutually exclusive dummies into a categorical variable?

I give an example of an algorithm that will encode at least some dummies (if at all possible) in the linked math.stackexchange question, but it neither is efficient nor guarantees the encoding of maximum possible number of dummies.

We can also think about observations and dummies as two types of nodes in a bimodal network. In this light, the question can be asked as: How do we find in R the largest set of nodes (dummies) unconnected with each other in a flattened network of dummies via having 1 in the same observation?

Solution

This problem is NP-complete and therefore has no general efficient solution. Take your formulation on the math network, it is easy to transform that into the maximal clique problem.

However for optimization purposes you do not need the very best solution, merely a good one. So a simple greedy approach is to recursively pick the variable that intersects the fewest other ones. That will produce a set of booleans that can be collapsed. Now repeat to collapse another set. Keep on going until every remaining boolean intersects some other one.