Search code examples
rdataframeranking

r data frame- convert values to text and rank from most to least occurrences per row


New to R and having an issue converting formulas from from excel to R. Any advice would be great. I have a data frame called allwins_df, where I want to change each value to the first letter of the column name.

Then I need to calculate the number of occurrences of each letter on each date and rank them from highest to lowest number of occurrences. Lastly, if there is a tie the tie-winner is the letter who's underlying number was the greatest. i.e on 1/5/01 one of the B values had a higher value than the A's so B is first, then A, then C. Sample of data frame:

Date        A.B  A.C  B.A   B.C  C.A  C.B…
2001-01-01  N/A  NA   0.14  0.35 0.43 NA
2001-01-02  0.8  NA   NA    0.5  NA   0.32
2001-01-03  0.75 0.8  NA    0.65 NA   0.9
2001-01-04  NA   0.66 0.91  NA   NA   NA
2001-01-05  0.52 0.62 0.48  0.82 0.4  NA

With some help I used the following code to turn the values into letters but not sure if I can use the resulting output to rank from most to least occurrences and handle ties.

 cols <- sub("\\..*", "",names(allwins_df)[-1])
    mat <- which(!is.na(allwins_df[-1]), arr.ind = TRUE)
    allwins_df[-1][mat]<- allwins_df[-1][mat] <- cols[mat[, 2]]

Output from code above -only changing values to letters:

Date        A.B A.C B.A B.C C.A C.B…
2001-01-01  N/A NA  B   B   C   NA
2001-01-02  A   A   NA  B   NA  C
2001-01-03  A   A   NA  B   NA  C
2001-01-04  NA  A   B   NA  NA  NA
2001-01-05  A   A   B   B   C   NA

Final Goal Output Example:

Date            
2001-01-01  B   C   
2001-01-02  A   B   C
2001-01-03  A   C   B
2001-01-04  B   A   
2001-01-05  B   A   C

Thanks in advance.


Solution

  • Here's a possible strategy with tidyverse libraries tidyr and dplyr

    Using the data

    dd<-read.table(text="Date        A.B  A.C  B.A   B.C  C.A  C.B
    2001-01-01  N/A  NA   0.14  0.35 0.43 NA
    2001-01-02  0.8  NA   NA    0.5  NA   0.32
    2001-01-03  0.75 0.8  NA    0.65 NA   0.9
    2001-01-04  NA   0.66 0.91  NA   NA   NA
    2001-01-05  0.52 0.62 0.48  0.82 0.4  NA", header=TRUE, na.strings=c("NA","N/A"))
    

    We can do the following

    library(tidyr)
    library(dplyr)
    dd %>% 
      pivot_longer(-Date) %>% 
      separate(name, c("first","second")) %>% 
      group_by(Date, first) %>% 
      filter(!is.na(value)) %>% 
      summarize(count=n(), max=max(value)) %>% 
      arrange(Date, desc(count), desc(max)) %>% 
      mutate(rank=row_number()) %>% 
      pivot_wider(Date, names_from=rank, values_from=first, values_fill=NA)
    

    Which returns

      Date       `1`   `2`   `3`  
      <chr>      <chr> <chr> <chr>
    1 2001-01-01 B     C     NA   
    2 2001-01-02 A     B     C    
    3 2001-01-03 A     C     B    
    4 2001-01-04 B     A     NA   
    5 2001-01-05 B     A     C   
    

    The idea is that you reshape your data into a more tidy format. Here we turn those columns into rows. Then split the names up so can we look at just the first letter. Then we count them and keep track of the max value for each better. We sort them, and then we reshape them to the wide format that matches your desired shape. Though you might want to rethink that because such a shape isn't always easy to work with in R. "Tidy" data is typically "rectangular" where each row has the same number of columns. Having ragged data like isn't pretty but it all depends on what you are going to do down stream.

    Note that in general you want to be careful about trying to directly translate code from excel to R. These two programs work very differently and sometime solving the same problem will involve very different strategy to be efficient in the different environments.