I am attempting to replace an inefficient nested for loop that will not run on a large dataset with the apply function.
unique <- cbind.data.frame(c(1,2,3))
colnames(unique) <- "note"
ptSeensub <- rbind.data.frame(c(1,"a"), c(1,"b"), c(2,"a"), c(2,"d"), c(3,"e"), c(3,"f"))
colnames(ptSeenSub) <- c("PARENT_EVENT_ID", "USER_NAME")
uniqueRow <- nrow(unique)
ptSeenSubRow <- nrow(ptSeenSubRow)
for (note in 1:uniqueRow)
{
for (row in 1:ptSeenSubRow)
{
if (ptSeenSub$PARENT_EVENT_ID[row] == unique$note[note])
{
unique$attending_name[note] <- ptSeenSub$USER_NAME[row]
unique$attending_name[note] <- ptSeenSub$USER_NAME[row +1]
}
}
}
I would like the results to be similar to this dataframe:
results <- rbind.data.frame(c(1, "a", "b"), c(2, "a", "d"), c(3,"e", "f"))
colnames(results) <- c("note", "attending_name", "resident_name")
The loop will be running over millions of rows and will not finish. How can I vectorize this to finish over large data sets? Any advice is greatly apprecaited
Sounds like you are trying to reshape data into wide format. I find that dplyr
and tidyr
find nice tools to accomplish this.
define data
library(tidyr)
library(dplyr)
ptSeenSub <- rbind.data.frame(c(1,"a"), c(1,"b"), c(2,"a"), c(2,"d"), c(3,"e"), c(3,"f"))
reshape
result <- ptSeenSub %>%
group_by(PARENT_EVENT_ID) %>%
mutate(k = row_number()) %>%
spread(k, USER_NAME)
You can then change names if you wish:
names(result) <- c("notes", "attending_name", "resident_name")