Search code examples
rdataframeresamplingfactorsapproximation

keep columns of type factor using approx on a data frame in R


I have a big dataframe with a lot of columns. Some of them are of type double and others are of the type factor. I resample the dataframe by adding a new column "time" with the approx function and the method = "constant". After that all factor columns are changed to doubles.

For example:

So my idea looks like this:

time = seq(1, 6, by = 0.1)

df1 <- data.frame(ecuTime = c(2, 4, 6), a = as.factor(c("male", "female", 
                                                   "male")), b = c(1, 3, 5))

df2 <- data.frame(ecuTime = c(1, 3.2, 3.4, 6), c = as.factor(c("car", "car", 
                                                    "bike", "car")), d = c(2, 3, 5, 6))

dfComb <- merge(df1, df2, by = "ecuTime", all = TRUE)

approxData <- cbind.data.frame(time, sapply(dfComb[, names(dfComb)], 
                                        function(y, x, nout) 
                                        approx(x, y, nout, method = "constant", na.rm = FALSE)$y,
                                        x = dfComb$ecuTime, nout = time))

Is it possible to keep the factor columns as factors and the columns of type double as doubles even if I use the function approx?

Edit: I found out that it doesn't make sense to use the approx function on factors and don't want to use na.rm = TRUE because I have a lot of NA's in some columns and if I replace them with previous values there will be a really big difference to the original data regarding the distributions etc. Is there an alternative Solution to just do the approx function for non factor columns and then merge it with the original factor columns? I think it makes sense to not fill up the factor columns with prior values and only use the original values connected with the resampled time like 0.1, 0.2 etc. After that it could be merged.

I am just confused how to combine df1 and df2 with a resampled time frequency so my distributions and line plots are completely different to the original data. My final goal I want to achieve is to make some comparison of some specific factors in a specific time frame. So I can't compare different variables because another one might be NA.


Solution

  • So, I'm not clear on the big picture of what you're trying to get done here, which is fine; I understand the specific question well enough. However, I'm trusting that you're really, really sure this is a good idea -- at face value, I'd be pretty worried about doing something resembling arithmetic via the approx() function on the underlying integers of a factor variable (which are totally meaningless). It seems to me like there is probably a "better" (i.e. less hacky) way to get this done, but I'm not in a position to help you do that since your overall goals aren't clear to me.

    That said, here's one possible road map to do what you want using base R:

    • identify which variables should be factors
    • inside approxData, convert those variables back into factor type
    • remap the levels of the new factor variables based on the corresponding values from df

    Code, expanded with an extra factor column (to verify that it runs properly in the case with more than one factor variable):

    time = 1:6
    df <- data.frame(ecuTime = c(2, 4, 6), a = as.factor(c("male", "female", 
                                                           "male")), b = c(1, 3, 5),
                     c = c("blue", "blue", "yellow"))
    str(df)
    
    approxData <- cbind.data.frame(time, sapply(df[, names(df)], 
                                                function(y, x, nout) 
                                                  approx(x, y, nout, method = "constant")$y,
                                                x = df$ecuTime, nout = time))
    str(approxData)
    
    factor_vars <- names(df[, sapply(df, is.factor)])
    approxData[, factor_vars] <- 
      lapply(factor_vars, function(x) {
        approxData[[x]] <- factor(approxData[[x]]); 
        levels(approxData[[x]]) <- levels(df[[x]]); 
        approxData[[x]]
      })
    
    str(approxData)
    

    For the edited question: here's some code to produce a new data frame, dfComb_resample. This data frame has an expanded ecuTime variable, values for a, b, c, d copied from df1 and df2 where appropriate, and NA values everywhere else. (If I missed the mark on what you wanted, let me know.)

    time = seq(1, 6, by = 0.1)
    
    df1 <- data.frame(ecuTime = c(2, 4, 6), a = as.factor(c("male", "female", 
                                                            "male")), b = c(1, 3, 5))
    
    df2 <- data.frame(ecuTime = c(1, 3.2, 3.4, 6), c = as.factor(c("car", "car", 
                                                                   "bike", "car")), d = c(2, 3, 5, 6))
    
    dfComb_resample <- 
      Reduce(function(x, y) merge(x=x, y=y, by = "ecuTime", all = TRUE),
             list(data.frame(ecuTime = time), df1, df2))
    

    How it works: Reduce() is a shortcut to merge three (or more) data frames at a time in this context. Note that you'd get some unexpected behavior if any of the merged data frames had variables in common, which they don't in this example.