Search code examples
rdummy-variable

How to add dummy variables in R for a large data set


I have a large data set with column names: ID and Property. There may be several rows sharing the same ID, which means that one ID has many different properties (categorical variable). I want to add dummy variables for property and finally get a data frame with distinct ID in each row, and indicate whether it has that property using 1/0. The original data has 2 million rows and 10000 distinct properties. So, ideally, I will shrink the row size by combining same IDs and add dummy variable columns (1 column for each property).

R crashes when I use the following code:

for(t in unique(df$property)){
df3[paste("property",t,sep="")] <- ifelse(df$property==t,1,0)

}

So I am wondering what's the most efficient way to add dummy variable columns for large data set in R?


Solution

  • We can just use table

    as.data.frame.matrix(table(df1))
    #  A B C D
    #1 1 1 0 0
    #3 0 0 1 0
    #4 0 0 0 1
    #5 0 0 0 2
    

    Or an efficient approach would be dcast from data.table

    library(data.table)
    dcast(setDT(df1), a~b, value.var = "a", length)
    

    data

    df1 <- structure(list(a = c(1L, 1L, 3L, 4L, 5L, 5L), b = c("A", "B", 
    "C", "D", "D", "D")), .Names = c("a", "b"), row.names = c("1", 
    "2", "3", "4", "5", "6"), class = "data.frame")