I have a dataframe representing the test set T, and another dataframe representing the training set D. The columns in these two data sets are exactly the same as they were extracted from the same dataframe.
I use the following codes to normalize the training set D
MaxMinNormalize <- function(num) {
if (is.factor(num)) num
else ((num - min(num)) / (max(num) - min(num)))
}
D_n <- as.data.frame(lapply(D, MaxMinNormalize))
Some columns in the data are factors, others numbers, that's why the normalize function.
I want to apply this normalization step on the test set T, with min
and max
values taken from respective columns in training set, not the test set. How should I go about doing that?
Thank you for any pointer!
Edits: As instructed by @coffeinjunky, the following codes were tried to test out the ability to work with mixed typed columns (numeric and factors):
df <- mtcars[,c("mpg", "cyl", "am", "gear")]
df$am <- as.factor(df$am)
df$gear <- as.factor(df$gear)
df1 <- df[1:16,]
df2 <- df[17:32,]
summary(df1)
summary(df2)
new_df <- data.frame(sapply(names(df1), function(col) {
ifelse(is.factor(df2[[col]]),
df2[[col]],
(df2[[col]]-min(df1[[col]]))/(max(df1[[col]])-min(df1[[col]])))
}))
head(new_df)
summary(new_df)
But the result is weird: somehow the function is stored in the data frame as well, and the columns' names were lost.
> head(new_df)
sapply.names.df1...function.col...
mpg 0.3071429
cyl 1.0000000
am 1.0000000
gear 1.0000000
> summary(new_df)
sapply.names.df1...function.col...
Min. :0.3071
1st Qu.:0.8268
Median :1.0000
Mean :0.8268
3rd Qu.:1.0000
Max. :1.0000
I suspect the ifelse
to deal with factor columns broke the structure of the data.
The probably easiest way is to use pre-existing functionality as it is the most convenient. Here, for instance, we could use functions provided in the caret package.
To illustrate, let us get some toy data:
# get some test data:
df <- mtcars[,c("mpg", "cyl")]
df1 <- df[1:16,] # training data
df2 <- df[17:32,] # test data to be scaled
Let's have a look to see what we would expect.
summary(df1) # some output ommitted
mpg cyl
Min. :10.4 Min. :4.0
Max. :24.4 Max. :8.0
summary(df2)
mpg cyl
Min. :13.30 Min. :4.000
Max. :33.90 Max. :8.000
We see that the range (max - min
) in df1
for mpg
is 14, and for cyl
it is 4. If we look at the max value for df2
, it is 33.9 for mpg
. Subtracting the min from df1
, i.e. 10.4, and dividing by 14, should give us 23.5/14=1.6785. Similar math holds for the other columns and values.
Now, let us use caret::preProcess
and see if we get the same value.
library(caret)
train_stats <- preProcess(df1, method = "range")
new_df1 <- predict(train_stats, df1)
new_df2 <- predict(train_stats, df2)
Let's first check if new_df1
is scaled to the 0-1 range, as it should be.
summary(new_df1)
# some output omitted:
mpg cyl
Min. :0.0000 Min. :0.000
Max. :1.0000 Max. :1.000
Now let's see if we get the expected values on the test set:
summary(new_df2)
# some output omitted:
mpg cyl
Min. :0.2071 Min. :0.0000
Max. :1.6786 Max. :1.0000
Yes, it looks like this worked.
Now, just to show how to implement this by hand
, consider that we need to go through each column, do an operation, and return the new column. This can often be achieved using a function of the apply
-family. Since two different dataframes are involved with identical column names, it seems to be an idea to iterate over the column names. For instance,
sapply(names(df1), function(x) (...) )
will apply function
with each column name in df1 as argument. Let's use it in the following way:
df2[] <- sapply(names(df1), function(col) {
if(is.factor(df2[[col]])) df2[[col]] else (df2[[col]]-min(df1[[col]]))/(max(df1[[col]])-min(df1[[col]]))})
Let's see if this gives the expected result:
summary(df2)
mpg cyl
Min. :0.2071 Min. :0.0000
Max. :1.6786 Max. :1.0000
which it does.