I have a dataset with a column containing names and a column indicating what the person did during the day. I am trying to figure out who met with whom in my dataset during that day using R. I created a vector containing the names in the dataset and used grepl in a loop to identify where the names appear in the column detailing the activity of the people in the dataset.
name <- c("Dupont","Dupuy","Smith")
activity <- c("On that day, he had lunch with Dupuy in London.",
"She had lunch with Dupont and then went to Brighton to meet Smith.",
"Smith remembers that he was tired on that day.")
met_with <- c("Dupont","Dupuy","Smith")
df<-data.frame(name, activity, met_with=NA)
for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}
However, this solution is not satisfying for two reasons. I can't extract more than one name when the person met with more than one other person (ex Dupuy in my example) and I cannot tell R not to return the name of the person when the name is used instead of a pronoun in my activity column (ex. Smith).
Ideally, I would like the df to look like:
name activity met_with
Dupont On that day, he had lunch with Dupuy in London. Dupuy
Dupuy She had lunch with Dupont and then (...). Dupont Smith
Smith Smith remembers that he was tired on that day. NA
I am cleaning up the strings to construct an edge list and node list to conduct network analysis later on.
Thank you
Same logic as @Gki but using stringr
functions and mapply
instead of loop.
library(stringr)
pat <- str_c('\\b', df$name, '\\b', collapse = '|')
df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '),
str_extract_all(df$activity, pat), df$name)
df
# name activity
#1 Dupont On that day, he had lunch with Dupuy in London.
#2 Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
#3 Smith Smith remembers that he was tired on that day.
# met_with
#1 Dupuy
#2 Dupont Smith
#3