sapply + if - retain column names

even though it is related to sapply - retain column names, I could not find the answer there...

I had a simple function to scale data between 0 and 1 that retained the column names:

scale <-  function(x){apply(x, 2, function(y) ((y)-min(y, na.rm=TRUE))/(max(y, na.rm=TRUE)-min(y, na.rm=TRUE)))}

Now I needed to add an if clause for the case wher max(y) = min(y) and changed the function like so:

scale <- function(x){apply(x, 2, function(y) if(min(y, na.rm=TRUE)==max(y, na.rm=TRUE)) {0.5} else {((y)-min(y, na.rm=TRUE))/(max(y, na.rm=TRUE)-min(y, na.rm=TRUE))})}

Using these functions on an input data frame like so...

as.data.frame(scale(input[sapply(input,is.numeric)]))

produces different column names where the original function preserved the names and the new one modifies them in a way where brackets or hyphens are replaced with dots:

Example column name w/o the IF: INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)

Example column name w/ the IF: INL_Avg.S.B0.ETC.CDS.06C.PM_CD1_D_B0_SI_P0V_B.NM.

While I do realized these column names are not ideal it is what I need to use and I would appreciate a hint as to how to avoid this special character replacement (adding USE.NAMES=TRUE to the sapply won't help...).

Thanks, Mark

Solution

The root of your issue is that you are using apply on a data frame. apply is built to work on matrices, so the first thing it does is convert your data frame to a matrix, which is unnecessary, and then the default data frame methods when you convert back "fix" the column names in a way you don't like. You may be able to fix this by adding check.names = FALSE to your as.data.frame() call, but a better approach would use lapply on a data frame, apply on a matrix, and even have it work if we give it a vector input.

I'd also strongly recommend not overwriting the built-in scale function with a similar-but-different function. That could easily cause bugs. I've rewritten your function calling it scale01() to make the distinction clear.

I also modified it so if the input is a constant vector with missing values, only the non-missing values will be filled in with 0.5, which seems safer.

I use S3 dispatch to work appropriately based on the input class, built on a default method that works on numeric vectors. Here it is, demonstrated on vector, data.frame, and matrix inputs:

## defining the functions
scale01 = function(x, ...) {
  UseMethod("scale01")
}

scale01.numeric = function(x, ...) {
  minx = min(x, na.rm = TRUE)
  maxx = max(x, na.rm = TRUE)
  if(minx == maxx) {
    x[!is.na(x)] = 0.5
    return(x)
  }
  (x - minx) / (maxx - minx)
}

scale01.data.frame = function(x, ...) {
  x[] = lapply(x, scale01)
  x
}

scale01.matrix = function(x, ...) {
  apply(x, MARGIN = 2, FUN = scale01)
}

## demonstrating usage

scale01(rnorm(5))
# [1] 0.0000000 1.0000000 0.4198958 0.6104154 0.2108150

scale01(mtcars[1:5, ])
#                 mpg cyl      disp        hp       drat        wt      qsec vs am gear      carb
# Mazda RX4         0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000  0  1    1 1.0000000
# Mazda RX4 Wag     0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195  0  1    1 1.0000000
# Datsun 710        1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765  1  1    1 0.0000000
# Hornet 4 Drive    0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000  1  0    0 0.0000000
# Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195  0  0    0 0.3333333

scale01(as.matrix(mtcars[1:5, ]))
#                         mpg cyl      disp        hp       drat        wt      qsec vs am gear      carb
# Mazda RX4         0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000  0  1    1 1.0000000
# Mazda RX4 Wag     0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195  0  1    1 1.0000000
# Datsun 710        1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765  1  1    1 0.0000000
# Hornet 4 Drive    0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000  1  0    0 0.0000000
# Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195  0  0    0 0.3333333

weird_name_df = data.frame(`weird column` = rnorm(5), `INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)` = rnorm(5), check.names = FALSE)
scale01(weird_name_df)
#   weird column INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)
# 1    0.6135744                                         0.2237905
# 2    0.0000000                                         0.4086837
# 3    1.0000000                                         1.0000000
# 4    0.7061441                                         0.2803262
# 5    0.7693184                                         0.0000000

If you want to transform all the numeric columns of a data frame, I would suggest:

## base version
numeric_cols = sapply(your_data, is.numeric)
your_data[numeric_cols] = scale01(your_data[numeric_cols])

## dplyr version
library(dplyr)
your_data %>%
  mutate(across(where(is.numeric), scale01))