I have a large data set with column names: ID and Property. There may be several rows sharing the same ID, which means that one ID has many different properties (categorical variable). I want to add dummy variables for property and finally get a data frame with distinct ID in each row, and indicate whether it has that property using 1/0. The original data has 2 million rows and 10000 distinct properties. So, ideally, I will shrink the row size by combining same IDs and add dummy variable columns (1 column for each property).
R crashes when I use the following code:
for(t in unique(df$property)){
df3[paste("property",t,sep="")] <- ifelse(df$property==t,1,0)
}
So I am wondering what's the most efficient way to add dummy variable columns for large data set in R?
We can just use table
as.data.frame.matrix(table(df1))
# A B C D
#1 1 1 0 0
#3 0 0 1 0
#4 0 0 0 1
#5 0 0 0 2
Or an efficient approach would be dcast
from data.table
library(data.table)
dcast(setDT(df1), a~b, value.var = "a", length)
df1 <- structure(list(a = c(1L, 1L, 3L, 4L, 5L, 5L), b = c("A", "B",
"C", "D", "D", "D")), .Names = c("a", "b"), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")