Question
I have cluster.id
and corresponding to these cluster.id's I have the different letters
found in each cluster (as simplification).
I'm interested in which letters are generally found together over the different clusters (I used the code from this answer), however I'm not interested in the proportions wherein each letters is found, so I wanted to remove duplicated rows (see code below).
This seems so work (no error) however the cast matrix gets filled with 'NA'
and strings instead of the desired counts (I explain everything further in the code comments below).
Any suggestions how to fix this problem, or is this just something that isn't possible after filtering for unique rows?
Code
test.set <- read.table(text = "
cluster.id letters
1 4 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
# remove irrelevant clusters (clusters which only contain 1 letter)
test.set <- test.set %>% group_by( cluster.id ) %>%
mutate(n.letters = n_distinct(letters)) %>%
filter(n.letters > 1) %>%
ungroup() %>%
select( -n.letters)
test.set
# cluster.id letters
#<int> <chr>
#1 4 A
#2 4 B
#3 4 B
#4 3 A
#5 3 E
#6 3 D
#7 3 C
#8 2 A
#9 2 E
# I dont want duplicated rows becasue they are misleading.
# I'm only interested in which letters are found togheter in a
# cluster not in what proportions
# Therefore I want to remove these duplicated rows
test.set.unique <- test.set %>% unique()
matrix <- acast(test.set.unique, cluster.id ~ letters)
matrix
# A B C D E
#2 "A" NA NA NA "E"
#3 "A" NA "C" "D" "E"
#4 "A" "B" NA NA NA
# This matrix contains NA values and letters intead of the counts I wanted.
# However using the matrix before filtering for unique rows works fine
matrix <- acast(test.set, cluster.id ~ letters)
matrix
# A B C D E
#2 1 0 0 0 1
#3 1 0 1 1 1
#4 1 2 0 0 0
If we also look at the messages, there would be a message above the output
Aggregation function missing: defaulting to length
In order to get similar output, specify the fun.aggregate
acast(test.set.unique, cluster.id ~ letters, length)
# A B C D E
#2 1 0 0 0 1
#3 1 0 1 1 1
#4 1 1 0 0 0
When there are duplicate elements, by default the fun.aggregate
is triggered for length
. With unique
elements, without specifying the fun.aggregate
, it will assume a value.var
column and fill the values of that column to get the output as in the OP's post