Search code examples
rcastinggroupingdcast

R cast can't deal with unique rows


Question


I have cluster.id and corresponding to these cluster.id's I have the different letters found in each cluster (as simplification).

I'm interested in which letters are generally found together over the different clusters (I used the code from this answer), however I'm not interested in the proportions wherein each letters is found, so I wanted to remove duplicated rows (see code below).

This seems so work (no error) however the cast matrix gets filled with 'NA' and strings instead of the desired counts (I explain everything further in the code comments below).

Any suggestions how to fix this problem, or is this just something that isn't possible after filtering for unique rows?

Code


test.set <- read.table(text = "
                            cluster.id   letters
                       1          4       A
                       2          4       B
                       3          4       B
                       4          3       A
                       5          3       E
                       6          3       D
                       7          3       C
                       8          2       A
                       9          2       E
                       10          1       A", header = T, stringsAsFactors = F)



# remove irrelevant clusters (clusters which only contain 1 letter)
test.set <- test.set %>% group_by( cluster.id ) %>%
  mutate(n.letters = n_distinct(letters)) %>%
  filter(n.letters > 1) %>%
  ungroup() %>%
  select( -n.letters)

test.set
#  cluster.id letters
#<int>   <chr>
#1          4       A
#2          4       B
#3          4       B
#4          3       A
#5          3       E
#6          3       D
#7          3       C
#8          2       A
#9          2       E



# I dont want duplicated rows becasue they are misleading.
# I'm only interested in which letters are found togheter in a 
# cluster not in what proportions
# Therefore I want to remove these duplicated rows

test.set.unique <- test.set %>% unique()
matrix <- acast(test.set.unique, cluster.id ~ letters)

matrix
#  A   B   C   D   E  
#2 "A" NA  NA  NA  "E"
#3 "A" NA  "C" "D" "E"
#4 "A" "B" NA  NA  NA 


# This matrix contains NA values and letters intead of the counts I wanted.
# However using the matrix before filtering for unique rows works fine

matrix <- acast(test.set, cluster.id ~ letters)
matrix
#  A B C D E
#2 1 0 0 0 1
#3 1 0 1 1 1
#4 1 2 0 0 0

Solution

  • If we also look at the messages, there would be a message above the output

    Aggregation function missing: defaulting to length

    In order to get similar output, specify the fun.aggregate

    acast(test.set.unique, cluster.id ~ letters, length)
    #  A B C D E
    #2 1 0 0 0 1
    #3 1 0 1 1 1
    #4 1 1 0 0 0
    

    When there are duplicate elements, by default the fun.aggregate is triggered for length. With unique elements, without specifying the fun.aggregate, it will assume a value.var column and fill the values of that column to get the output as in the OP's post