Search code examples
rdataframenormalize

R: How to normalize one dataframe (test set) given the values of a different dataframe (training set)


I have a dataframe representing the test set T, and another dataframe representing the training set D. The columns in these two data sets are exactly the same as they were extracted from the same dataframe.

I use the following codes to normalize the training set D

MaxMinNormalize <- function(num) {
  if (is.factor(num)) num
  else ((num - min(num)) / (max(num) - min(num)))
}

D_n <- as.data.frame(lapply(D, MaxMinNormalize))

Some columns in the data are factors, others numbers, that's why the normalize function.

I want to apply this normalization step on the test set T, with min and max values taken from respective columns in training set, not the test set. How should I go about doing that?

Thank you for any pointer!


Edits: As instructed by @coffeinjunky, the following codes were tried to test out the ability to work with mixed typed columns (numeric and factors):

df <- mtcars[,c("mpg", "cyl", "am", "gear")]

df$am <- as.factor(df$am)

df$gear <- as.factor(df$gear)

df1 <- df[1:16,]
df2 <- df[17:32,]

summary(df1)
summary(df2)

new_df <- data.frame(sapply(names(df1), function(col) {
  ifelse(is.factor(df2[[col]]), 
         df2[[col]],
         (df2[[col]]-min(df1[[col]]))/(max(df1[[col]])-min(df1[[col]]))) 

}))

head(new_df)
summary(new_df)

But the result is weird: somehow the function is stored in the data frame as well, and the columns' names were lost.

> head(new_df)
     sapply.names.df1...function.col...
mpg                           0.3071429
cyl                           1.0000000
am                            1.0000000
gear                          1.0000000
> summary(new_df)
 sapply.names.df1...function.col...
 Min.   :0.3071                    
 1st Qu.:0.8268                    
 Median :1.0000                    
 Mean   :0.8268                    
 3rd Qu.:1.0000                    
 Max.   :1.0000    

I suspect the ifelse to deal with factor columns broke the structure of the data.


Solution

  • The probably easiest way is to use pre-existing functionality as it is the most convenient. Here, for instance, we could use functions provided in the caret package.

    To illustrate, let us get some toy data:

    # get some test data:
    df <- mtcars[,c("mpg", "cyl")]
    df1 <- df[1:16,]  # training data
    df2 <- df[17:32,] # test data to be scaled
    

    Let's have a look to see what we would expect.

    summary(df1) # some output ommitted
          mpg            cyl     
     Min.   :10.4   Min.   :4.0  
     Max.   :24.4   Max.   :8.0  
    
    summary(df2)
          mpg             cyl       
     Min.   :13.30   Min.   :4.000  
     Max.   :33.90   Max.   :8.000  
    

    We see that the range (max - min) in df1 for mpg is 14, and for cyl it is 4. If we look at the max value for df2, it is 33.9 for mpg. Subtracting the min from df1, i.e. 10.4, and dividing by 14, should give us 23.5/14=1.6785. Similar math holds for the other columns and values.

    Now, let us use caret::preProcess and see if we get the same value.

    library(caret)
    train_stats <- preProcess(df1, method = "range")
    new_df1 <- predict(train_stats, df1)
    new_df2 <- predict(train_stats, df2)
    

    Let's first check if new_df1 is scaled to the 0-1 range, as it should be.

    summary(new_df1)
    # some output omitted:
          mpg              cyl       
     Min.   :0.0000   Min.   :0.000  
     Max.   :1.0000   Max.   :1.000  
    

    Now let's see if we get the expected values on the test set:

    summary(new_df2)
    # some output omitted:
          mpg              cyl        
     Min.   :0.2071   Min.   :0.0000  
     Max.   :1.6786   Max.   :1.0000  
    

    Yes, it looks like this worked.

    Now, just to show how to implement this by hand, consider that we need to go through each column, do an operation, and return the new column. This can often be achieved using a function of the apply-family. Since two different dataframes are involved with identical column names, it seems to be an idea to iterate over the column names. For instance,

    sapply(names(df1), function(x) (...) )
    

    will apply function with each column name in df1 as argument. Let's use it in the following way:

    df2[] <- sapply(names(df1), function(col) {
        if(is.factor(df2[[col]])) df2[[col]] else (df2[[col]]-min(df1[[col]]))/(max(df1[[col]])-min(df1[[col]]))})
    

    Let's see if this gives the expected result:

    summary(df2)
          mpg              cyl        
     Min.   :0.2071   Min.   :0.0000  
     Max.   :1.6786   Max.   :1.0000  
    

    which it does.