Let's say I have a document with some text, like this, from SO:
doc <- 'Questions with similar titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
I can then make a dataframe where every word has a row in a df:
library(stringi)
dfall <- data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc))))
We'll add a third column with its unique id. To get the ID, remove duplicates:
library(dplyr)
uniquedf <- distinct(data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc)))))
I'm struggling with how to match the rows against the two dataframes to extract the row index value from uniquedf
as a new row value for df
alldf <- alldf %>% mutate(id = which(uniquedf$words == words))
A dply method like this doesn't work.
Is there a more efficient way to do this?
To give an even simpler example to show the expected output, I'd like a dataframe that looks like this:
words id
1 to 1
2 row 2
3 zip 3
4 zip 3
Where my starting word vector is: doc <- c('to', 'row', 'zip', 'zip')
or doc <- c('to row zip zip')
. The id column adds a unique id for each unique word.
cheap way using sapply
data
doc <- 'Questions with with titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
function
alldf=cbind(dfall,sapply(1:nrow(dfall),function(x) which(uniquedf$words==dfall$words[x])))
colnames(alldf)=c("words","id")
> alldf
words id
1 questions 1
2 with 2
3 with 2
4 titles 3
5 have 4
6 frequently 5
7 been 6
8 downvoted 7
9 and 8
10 or 9
11 closed 10
12 consider 11
13 using 12
14 a 13
15 title 14
16 that 15
17 more 16
18 accurately 17
19 describes 18
20 your 19
21 question 20