r list vector frequency frequency-analysis

Combining all elements in a vector of lists based on the common first element of each list in the vector in R

I have a fairly large vector of lists (about 300,000 rows). For example, let's consider the following:

vec = c( 
  list(c("A",10,11,12)), 
  list(c("B",10,11,15)),
  list(c("A",10,12,12,16)),
  list(c("A",11,12,16,17)) )

Now, I want to do the following:

For each unique first element of each list in the vector, I need all the unique elements occurring corresponding to this in all the lists in the vector, along with the respective frequencies.

Output would be somewhat like:

For A, I would have elements 10, 11 12, 16 & 17 with frequencies 2,2,4,2 & 1 respectively. For B, it would be 10, 11, 15 with frequencies 1,1,1.

Many thanks in advance, Ankur.

Solution

Here's one way to do it.

First, a simpler way to create your list is:

L <- list(c("A", 10, 11, 12), 
          c("B", 10, 11, 15), 
          c("A", 10, 12, 12, 16), 
          c("A", 11, 12, 16, 17))

Now you can split by the first character, and then tabulate all but the first character.

tapply(L, sapply(L, '[[', 1), function(x) 
  table(unlist(lapply(x, function(x) x[-1]))))

## $A
## 
## 10 11 12 16 17 
##  2  2  4  2  1 
## 
## $B
## 
## 10 11 15 
##  1  1  1

Scaling up to a list comprising 300,000 elements of similar size:

L <- replicate(300000, c(sample(LETTERS, 1), sample(100, sample(3:4, 1))))

system.time(
  freqs <- tapply(L, sapply(L, '[[', 1), function(x) 
    table(unlist(lapply(x, function(x) x[-1]))))
)

## user  system elapsed 
## 0.68    0.00    0.69

If you want to sort the vectors of the resulting list, as per the OP's comment below, you can just modify the function applied to the groups of L:

tapply(L, sapply(L, '[[', 1), function(x) 
  sort(table(unlist(lapply(x, function(x) x[-1]))), decreasing=TRUE))

## $A
## 
## 12 10 11 16 17 
##  4  2  2  2  1 
## 
## $B
## 
## 10 11 15 
##  1  1  1

If you only want to tabulate the values for a particular group, e.g. group A (the vectors that begin with A), you can either subset the above result:

L2 <- tapply(L, sapply(L, '[[', 1), function(x) 
  sort(table(unlist(lapply(x, function(x) x[-1]))), decreasing=TRUE), 
  simplify=FALSE)

L2$A

(Note that I've added simplify=FALSE so that this will work even if the number of unique elements is the same across groups.)

It's more efficient to only perform the operation for the group of interest, though, in which case maybe the following is better:

sort(table(unlist(
  lapply(split(L, sapply(L, '[[', 1))$A, function(x) x[-1])
)), decreasing=TRUE)

where split first splits L into groups according to vectors' first element, and we then subset to just group A with $A.