I have two large datasets, one that has it's grouping column removed, as well as any duplicates, and the original data. My problem is that I then need to reattach the grouping column from the original data after a bunch of data-wrangling/ machine-learning, to the other dataset (that has the duplicates/grouping column removed). I have tried to replicate this in an example:
#Using iris data set, adding duplictaes as my real dataset involves duplicates
#repeat row 1 ,20 times
iris1 <- rbind(iris, iris[rep(1, 20), ])
#new dataset with dupes removed and species column removed
iris_rm <- subset(iris1, select = -c(Species) )
iris_rm <- iris_rm[!duplicated(iris_rm), ]
# Data analysis occuring here...
#
#
#
#
#I then want to semi_join the two datasets, without the Species column,
# i.e. returning all the rows that are from iris1, that match iris_rm (ignoring the Species column)
library(dplyr)
new <- semi_join(iris_rm, iris1[,-5], by = NULL)
#How do I then reattach the Species column to the new dataframe?
#I have tried this, however as there are differing row lengths, it won't work
new['Species2']= iris1['Species']
Ideally, in the semi_join, it would ignore the Species column, without actually removing it from the new
dataframe. Note that the duplicated rows in the dataset are not true duplicates (i.e. the species column could be different despite the rest of the columns being the same). Hope this makes sense!
Maybe you can use inner_join instead of semi_join. It can create duplicate rows so you need to use distinct().
new <- inner_join(iris_rm, iris1) %>% distinct()