Search code examples
rgroupingdata-representation

pie chart of co-presence in clusters for about 10 factors in r


I've got a two-column dataset with about 30000 clusters and 10 factors like this:

cluster-1 Factor1
cluster-1 Factor2
...
cluster-2 Factor2
cluster-2 Factor3
...

And I would like to represent the co-occurrence of factors in the clusterset. Something like "Factor1+Factor3+Factor5 in 1234 clusters", and so on for the different combinations. I thought I could so something like a pie chart, but with 10 factors, I take there can be too many combinations.

What would be a good way of representing this?


Solution

  • There is one good programming question in here that should be addressed:

    How do I count the number of co-occurrences of factors in the different clusters?

    First simulate some data:

    n = 1000
    
    set.seed(12345)
    n.clusters = 100
    clusters = rep(1:n.clusters, length.out=n)
    
    n.factors = 10
    factors = round(rnorm(n, n.factors/2, n.factors/5))
    factors[factors > n.factors] = n.factors
    factors[factors < 1] = 1
    
    data = data.frame(cluster=clusters, factor=factors)
    
    > data
      cluster factor
    1       1      6
    2       2      6
    3       3      5
    4       4      4
    5       5      6
    6       6      1
    ...
    

    Then here is the code that could be used to tabulate the number of times each combination of factors occurs in the clusters:

    counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse=''))))
    

    This can be represented as a simple pie chart, for example,

    dev.new(width=5, height=5)
    pie(counts[counts>1])
    

    enter image description here

    but simple counts like this are often most efficiently displayed as a sorted table. For more on this, check out Edward Tufte.