Let's say I have a data frame df
that looks like this:
df = data.frame(c("A", "A", "B", "B", "C", "D", "D", "D", "E"),
c(0.1, 0.3, 0.1, 0.8, 0.4, 0.7, 0.5, 0.2, 0.1),
c("v1", "v2", "v1", "v3", "v4", "v2", "v3", "v4", "v2"))
colnames(df) = c("entry", "value", "point")
df = df[order(df$entry, -df$value),]
df
entry value point
2 A 0.3 v2
1 A 0.1 v1
4 B 0.8 v3
3 B 0.1 v1
5 C 0.4 v4
6 D 0.7 v2
7 D 0.5 v3
8 D 0.2 v4
9 E 0.1 v2
I would like to convert it eventually into a matrix of "ranked lists", that has as rows the unique values in the entry
column and the number of columns should be equal to the maximum number of unique elements in the point
column for a given entry
. In this example it would be 3. Each row should be populated with the corresponding values from the point
column, sorted descendingly based on the corresponding elements in value
(e.g., row A
should have v2
as value in the first column). In case an entry
has less points
than the number of columns in the matrix, the rest of the row should be filled with NAs
.
So, the expected output should look something like this:
>df
1 2 3
A v2 v1 NA
B v3 v1 NA
C v4 NA NA
D v2 v3 v4
E v2 NA NA
So far I have tried to create some sort of contingency table using
with(df, table(df$point, df$entry))
but of course my actual data is in the order of millions of entries, and the above command raises to huge amounts of RAM even when subsetting to 100 entries
with a couple hundreds of unique points
. I have also tried
xtabs(~ entry + point, data=df)
with the same results on my real data. Next I have tried to split it into ordered lists using
df = split(df$point, df$entry)
which works fine and it is fast enough, buuuuut.. now I have problems converting it to the result matrix. Something along those lines probably
matrix(sapply(df, function(x) unlist(x)), nrow=length(df), ncol=max(sapply(df, length)))
or first initialize a matrix and do some rbind
or something?
res = matrix(NA, nrow=length(df), ncol=max(sapply(df, length)))
rownames(res) = names(df)
....
Can you please assist?
With dplyr
:
df %>%
group_by(entry) %>%
mutate(unq=rank(rev(value))) %>%
select(-value) %>%
tidyr::spread(unq,point)
# A tibble: 5 x 4
# Groups: entry [5]
entry `1` `2` `3`
<fct> <fct> <fct> <fct>
1 A v2 v1 NA
2 B v3 v1 NA
3 C v4 NA NA
4 D v2 v3 v4
5 E v2 NA NA