Search code examples
rstandardization

How do you remerge the response variable to the data frame after removing it for standardization?


I have a dataset with 61 columns (60 explanatory variables and 1 response variable).

All the explantory variables all numerical, and the response is categorical (Default).Some of the ex. variables have negative values (financial data), and therefore it seems more sensible to standardize rather than normalize. However, when standardizing using the "apply" function, I have to remove the response variable first, so I do:

model <- read.table......

modelwithnoresponse <- model 
modelwithnoresponse$Default <- NULL
means <- apply(modelwithnoresponse,2mean)
standarddeviations <- apply(modelwithnoresponse,2,sd)
modelSTAN <- scale(modelwithnoresponse,center=means,scale=standarddeviations)

So far so good, the data is standardized. However, now I would like to add the response variable back to the "modelSTAN". I've seen some posts on dplyr, merge-functions and rbind, but I couldnt quite get to work so that response would simply be added back as the last column to my "modelSTAN".

Does anyone have a good solution to this, or maybe another workaround to standardize it without removing the response variable first?

I'm quite new to R, as I'm a finance student and took R as an elective..


Solution

  • If you want to add the column model$Default to the modelSTAN data frame, you can do it like this

    # assign the column directly
    modelSTAN$Default <- model$Default
    # or use cbind for columns (rbind is for rows)
    modelSTAN <- cbind(modelSTAN, model$Default)
    

    However, you don't need to remove it at all. Here's an alternative:

    modelSTAN <- model 
    ## get index of response, here named default
    resp <- which(names(modelSTAN) == "default")
    ## standardize all the non-response columns
    means <- colMeans(modelSTAN[-resp])
    sds <- apply(modelSTAN[-resp], 2, sd)
    modelSTAN[-resp] <- scale(modelSTAN[-resp], center = means, scale = sds)
    

    If you're interested in dplyr:

    library(dplyr)
    modelSTAN <- model %>%
      mutate(across(-all_of("default"), scale))
    

    Note, in the dplyr version I didn't bother saving the original means and SDs, you should still do that if you want to back-transform later. By default, scale will use the mean and sd.