Search code examples
rdplyrnormalization

Normalize By Group for All Columns


I have a dataset with 400+ columns. The first two columns I want to exclude in this calculation but I want to include it in my final output.

For columns, 3:ncol(df), how do I normalize by group?

This is what I have right now but it gave me an error and it took very long to run at the same time:

library(BBmisc)
test<-df %>% 
group_by(group) %>% 
mutate_all(.vars = df[3:ncol(df)], 
           funs(normalize))

I want to be able to set the range from 0 to 5.

This is what my dataset looks like:

df
group    week    col3     col4    col5 ......
    A       1      25       56      87 ......
    A       2      21       34      98 ......
    A       3      34       67     100 ......
    B       1      11      120    1000 ......
    B       2       8      340    1200 ......
    B       3       2      560    2000 ......

I want to apply a normalization function from column 3 and on (not hardcoded to col3 to col5 because I have 400 columns in total) by group


Solution

  • We need mutate_at

    df %>% 
      group_by(group) %>% 
      mutate_at(vars(-one_of("week")), normalize)
    # A tibble: 6 x 5
    # Groups:   group [2]
    #  group  week   col3   col4   col5
    #  <chr> <int>  <dbl>  <dbl>  <dbl>
    #1 A         1 -0.250  0.218 -1.14 
    #2 A         2 -0.851 -1.09   0.429
    #3 A         3  1.10   0.873  0.714
    #4 B         1  0.873 -1     -0.756
    #5 B         2  0.218  0     -0.378
    #6 B         3 -1.09   1      1.13 
    

    As mentioned in the comments,if we are using a range of index, then should be careful with mutate_at. Currently, the indexing starts without the group_by column. So, if we wanted to pass a range of index, subtract one from the start and stop positions

    df %>%
        group_by(group) %>%
        mutate_at(vars((3-1):(ncol(.)-1)), normalize)
    # A tibble: 6 x 5
    # Groups:   group [2]
    #  group  week   col3   col4   col5
    #  <chr> <int>  <dbl>  <dbl>  <dbl>
    #1 A         1 -0.250  0.218 -1.14 
    #2 A         2 -0.851 -1.09   0.429
    #3 A         3  1.10   0.873  0.714
    #4 B         1  0.873 -1     -0.756
    #5 B         2  0.218  0     -0.378
    #6 B         3 -1.09   1      1.13 
    

    data

    df <- structure(list(group = c("A", "A", "A", "B", "B", "B"), week = c(1L, 
    2L, 3L, 1L, 2L, 3L), col3 = c(25L, 21L, 34L, 11L, 8L, 2L), col4 = c(56L, 
    34L, 67L, 120L, 340L, 560L), col5 = c(87L, 98L, 100L, 1000L, 
    1200L, 2000L)), class = "data.frame", row.names = c(NA, -6L))