Search code examples
rdataframenormalization

R: Applying normalization function column wise - large DataFrame/DataTable


I have a large r data.frame with close to 500 columns. I want to add existing scale function and also try out different normalization function in a column wise fashion.

As of existing scale function

library(dplyr)

set.seed(1234)
dat <- data.frame(x = rnorm(10, 30, .2), 
                  y = runif(10, 3, 5),
                  z = runif(10, 10, 20), k = runif(10, 5, 10))

dat %>% mutate_each_(funs(scale),vars=c("y","z")) 

Question1: In this case vars are only two but when you have 500 columns to normalized whats the best way? I tried following:

dnot <- c("y", "z")
dat %>% mutate_each_(funs(scale),vars=!(names(dat) %in% dnot)) 

Error:

Error in UseMethod("as.lazy_dots") : 
  no applicable method for 'as.lazy_dots' applied to an object of class "logical"

Question2: Instead of using inbuilt scale function I want to apply my own function to normalize the data frame.

example: I have following function

normalized_columns <- function(x)
{
  r <- (x/sum(x))
}

Question2: How can I efficiently apply this to all the columns while leaving out only 3 or 4 columns.


Solution

  • There are better approaches, but I usually do something like:

    set.seed(1234)
    x = rnorm(10, 30, .2)
    y = runif(10, 3, 5)
    z = runif(10, 10, 20)
    k = runif(10, 5, 10)
    a = rnorm(10, 30, .2)
    b = runif(10, 3, 5)
    c = runif(10, 10, 20)
    d = runif(10, 5, 10)
    
    normalized_columns <- function(x)
    {
    x/sum(x)
    }
    
    dat<-data.frame(x,y,z,k,a,b,c,d)
    dat[,c(1,4,6:8)]<-sapply(dat[,c(1,4,6:8)], normalized_columns)
    

    Edit: as far as efficiency goes, this is pretty fast:

    set.seed(100)
    dat<-data.frame(matrix(rnorm(50000, 5, 2), nrow = 100, ncol = 500))
    cols<-sample.int(500, 495, replace = F)
    system.time(dat[,cols]<-sapply(dat[,cols], normalized_columns))
    ##user  system elapsed 
    ##0.03    0.00    0.03