Search code examples
rfunctionnormalization

How to create a normalised rank column based on subset of factor variable?


Sample of data frame:

df <- structure(list(Rank = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("2001", "2003", "2005", "2007", "2009", "2011", "2013", "2015", "2017"), class = "factor")), .Names = c("Rank", "Year"), row.names = c(NA, -44L), class = c("tbl_df", "tbl", "data.frame"))

I have a "Rank" vector in my data frame, with ranges from 1 to x, depending on the factor value from "Year". I want to be able to normalise the ranks depending on the Year value, and put this in to a new column.

So far I'm using this custom function for normalising between 0 and 1:

range01 <- function(x){(x-min(x))/(max(x)-min(x))}

I created a new column filled with NAs.

df$normrank <- NA

Then I tried this to generate the normalised ranks for 2001, however this creates duplicate copies of the new normalised rank vector into each row of the subsetted factor value, rather then having matched normalised values in the rows with the corresponding old value.

df$normrank[which(df$Year==2001)] <- range01(subset(df, Year == 2001, select=Rank))

I'll at a stump at how to fix this, would appreciate any help - possibly with another custom function?


Solution

  • One solution using dplyr can be based on group_by on Year column and then applying normalizing factor to calculate normrank.

    df %>% group_by(Year) %>%
      mutate(normrank  = (Rank - min(Rank)) / (max(Rank)+min(Rank)) )
    
    # # Groups: Year [3]
    # Rank Year   normrank
    # <dbl> <fctr>    <dbl>
    #   1  1.00 2001     0     
    # 2  2.00 2001     0.0556
    # 3  3.00 2001     0.111 
    # 4  4.00 2001     0.167 
    # 5  5.00 2001     0.222 
    # 6  6.00 2001     0.278 
    # 7  7.00 2001     0.333 
    # 8  8.00 2001     0.389 
    # 9  9.00 2001     0.444 
    # 10 10.0  2001     0.500 
    # # ... with 34 more rows
    

    Now data is ready to filter on any Year.