Search code examples
rif-statementsapplynames

sapply + if - retain column names


even though it is related to sapply - retain column names, I could not find the answer there...

I had a simple function to scale data between 0 and 1 that retained the column names:

scale <-  function(x){apply(x, 2, function(y) ((y)-min(y, na.rm=TRUE))/(max(y, na.rm=TRUE)-min(y, na.rm=TRUE)))}

Now I needed to add an if clause for the case wher max(y) = min(y) and changed the function like so:

scale <- function(x){apply(x, 2, function(y) if(min(y, na.rm=TRUE)==max(y, na.rm=TRUE)) {0.5} else {((y)-min(y, na.rm=TRUE))/(max(y, na.rm=TRUE)-min(y, na.rm=TRUE))})}

Using these functions on an input data frame like so...

as.data.frame(scale(input[sapply(input,is.numeric)]))

produces different column names where the original function preserved the names and the new one modifies them in a way where brackets or hyphens are replaced with dots:

Example column name w/o the IF: INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)

Example column name w/ the IF: INL_Avg.S.B0.ETC.CDS.06C.PM_CD1_D_B0_SI_P0V_B.NM.

While I do realized these column names are not ideal it is what I need to use and I would appreciate a hint as to how to avoid this special character replacement (adding USE.NAMES=TRUE to the sapply won't help...).

Thanks, Mark


Solution

  • The root of your issue is that you are using apply on a data frame. apply is built to work on matrices, so the first thing it does is convert your data frame to a matrix, which is unnecessary, and then the default data frame methods when you convert back "fix" the column names in a way you don't like. You may be able to fix this by adding check.names = FALSE to your as.data.frame() call, but a better approach would use lapply on a data frame, apply on a matrix, and even have it work if we give it a vector input.

    I'd also strongly recommend not overwriting the built-in scale function with a similar-but-different function. That could easily cause bugs. I've rewritten your function calling it scale01() to make the distinction clear.

    I also modified it so if the input is a constant vector with missing values, only the non-missing values will be filled in with 0.5, which seems safer.

    I use S3 dispatch to work appropriately based on the input class, built on a default method that works on numeric vectors. Here it is, demonstrated on vector, data.frame, and matrix inputs:

    ## defining the functions
    scale01 = function(x, ...) {
      UseMethod("scale01")
    }
    
    scale01.numeric = function(x, ...) {
      minx = min(x, na.rm = TRUE)
      maxx = max(x, na.rm = TRUE)
      if(minx == maxx) {
        x[!is.na(x)] = 0.5
        return(x)
      }
      (x - minx) / (maxx - minx)
    }
    
    scale01.data.frame = function(x, ...) {
      x[] = lapply(x, scale01)
      x
    }
    
    scale01.matrix = function(x, ...) {
      apply(x, MARGIN = 2, FUN = scale01)
    }
    
    ## demonstrating usage
    
    scale01(rnorm(5))
    # [1] 0.0000000 1.0000000 0.4198958 0.6104154 0.2108150
    
    scale01(mtcars[1:5, ])
    #                 mpg cyl      disp        hp       drat        wt      qsec vs am gear      carb
    # Mazda RX4         0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000  0  1    1 1.0000000
    # Mazda RX4 Wag     0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195  0  1    1 1.0000000
    # Datsun 710        1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765  1  1    1 0.0000000
    # Hornet 4 Drive    0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000  1  0    0 0.0000000
    # Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195  0  0    0 0.3333333
    
    scale01(as.matrix(mtcars[1:5, ]))
    #                         mpg cyl      disp        hp       drat        wt      qsec vs am gear      carb
    # Mazda RX4         0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000  0  1    1 1.0000000
    # Mazda RX4 Wag     0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195  0  1    1 1.0000000
    # Datsun 710        1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765  1  1    1 0.0000000
    # Hornet 4 Drive    0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000  1  0    0 0.0000000
    # Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195  0  0    0 0.3333333
    
    weird_name_df = data.frame(`weird column` = rnorm(5), `INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)` = rnorm(5), check.names = FALSE)
    scale01(weird_name_df)
    #   weird column INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)
    # 1    0.6135744                                         0.2237905
    # 2    0.0000000                                         0.4086837
    # 3    1.0000000                                         1.0000000
    # 4    0.7061441                                         0.2803262
    # 5    0.7693184                                         0.0000000
    

    If you want to transform all the numeric columns of a data frame, I would suggest:

    ## base version
    numeric_cols = sapply(your_data, is.numeric)
    your_data[numeric_cols] = scale01(your_data[numeric_cols])
    
    ## dplyr version
    library(dplyr)
    your_data %>%
      mutate(across(where(is.numeric), scale01))