Search code examples
rlinear-regression

Converting a grouped continuous variable into rows in R


I have a data frame with these values dummy vales and I want to do lm regression on them. One of the variables is a grouped continuous variable as shown below

df <- data.frame("y" = c(10, 11, 12, 13, 14),
                 "x" = as.factor(c("100-102", "103-105", "106-108", "109-111", "112-114")))

I want to regress y~x, One way is to replace the x factors with their mean numeric values. This is easily done using regular expression.

Another way is to create the additional rows and expand your dataset so it looks like this

data.frame("y" = c(10, 10, 10, 11, 11, 11......),
           "x" = c(100, 101, 102, 103, 104, 105......))

Is there a function that will do this?

I'm thinking of first creating additional variables like x1, x2, x3 and then use reshape2 package to convert the x columns to rows.


Solution

  • A data.table solution. This should be really fast on large data.frame's as well.

    require(data.table)
    dt <- data.table(df, key="y")
    dt[, list(x=seq(sub("-.*$", "", x), sub(".*-", "", x))),by=y]
    

    If you have more columns and you don't want each combinations while splitting by column x, then this is the code to use:

    require(data.table)
    dt <- data.table(df)
    # get all column names except "x"
    key.cols <- setdiff(names(df), "x") 
    # set the data.table columns to key.cols
    setkeyv(dt, key.cols)
    dt.out <- dt[, list(x=seq(sub("-.*$", "", x), sub(".*-", "", x))), by = key.cols]
    

    This should give you what you expect.