Search code examples
rdataframematrixranking

Convert data frame into a matrix of "ranked lists" based on unique values in column


Let's say I have a data frame df that looks like this:

df = data.frame(c("A", "A", "B", "B", "C", "D", "D", "D", "E"), 
        c(0.1, 0.3, 0.1, 0.8, 0.4, 0.7, 0.5, 0.2, 0.1),
        c("v1", "v2", "v1", "v3", "v4", "v2", "v3", "v4", "v2"))

colnames(df) = c("entry", "value", "point")
df = df[order(df$entry, -df$value),]

df
   entry value point
2     A   0.3    v2
1     A   0.1    v1
4     B   0.8    v3
3     B   0.1    v1
5     C   0.4    v4
6     D   0.7    v2
7     D   0.5    v3
8     D   0.2    v4
9     E   0.1    v2

I would like to convert it eventually into a matrix of "ranked lists", that has as rows the unique values in the entry column and the number of columns should be equal to the maximum number of unique elements in the point column for a given entry. In this example it would be 3. Each row should be populated with the corresponding values from the point column, sorted descendingly based on the corresponding elements in value (e.g., row A should have v2 as value in the first column). In case an entry has less points than the number of columns in the matrix, the rest of the row should be filled with NAs.

So, the expected output should look something like this:

>df
   1   2   3  
A  v2  v1  NA 
B  v3  v1  NA 
C  v4  NA  NA 
D  v2  v3  v4
E  v2  NA  NA

So far I have tried to create some sort of contingency table using

with(df, table(df$point, df$entry))

but of course my actual data is in the order of millions of entries, and the above command raises to huge amounts of RAM even when subsetting to 100 entries with a couple hundreds of unique points. I have also tried

xtabs(~ entry + point, data=df)

with the same results on my real data. Next I have tried to split it into ordered lists using

df = split(df$point, df$entry)

which works fine and it is fast enough, buuuuut.. now I have problems converting it to the result matrix. Something along those lines probably

matrix(sapply(df, function(x) unlist(x)), nrow=length(df), ncol=max(sapply(df, length)))

or first initialize a matrix and do some rbind or something?

res = matrix(NA, nrow=length(df), ncol=max(sapply(df, length)))
rownames(res) = names(df)
....

Can you please assist?


Solution

  • With dplyr:

    df %>% 
       group_by(entry) %>% 
       mutate(unq=rank(rev(value))) %>% 
       select(-value) %>% 
       tidyr::spread(unq,point)
    # A tibble: 5 x 4
    # Groups:   entry [5]
      entry `1`   `2`   `3`  
      <fct> <fct> <fct> <fct>
    1 A     v2    v1    NA   
    2 B     v3    v1    NA   
    3 C     v4    NA    NA   
    4 D     v2    v3    v4   
    5 E     v2    NA    NA