Trying this command:
library("spacyr")
library("dplyr", warn.conflicts = FALSE)
mytext <- data.frame(text = c("test text", "section 2 sending"),
id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text)
df3 <- data.frame(text = df2$text, id = df2$id)
dflemma <- spacy_parse(structure(df3$text, names = df3$id),
lemma = TRUE, pos = FALSE) %>%
mutate(id = doc_id) %>%
group_by(id) %>%
summarize(body = paste(lemma, collapse = " "))
the expected output is the long to wide format using the same id and separate the merge text with a space. Here the expected output
data.frame(text = c("test text", "section 2 send"),
id = c(32,41)
However the command provide this error:
Error in process_document(x, multithread) : Docnames are duplicated.
You get this error because you separate each of your text phrases to words. You shouldn't do that. Consider the following code:
mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
dflemma <-
spacy_parse(structure(mytext$text, names = mytext$id), lemma = TRUE, pos = FALSE) %>%
group_by(id = doc_id) %>%
summarise(text = paste(lemma, collapse = " "))
Output
> dflemma
# A tibble: 2 x 2
id text
<chr> <chr>
1 32 test text
2 41 section 2 send
Update
If you have to do the separation, then you need to further modify your id
column to ensure that each observation in it is unique. Later you can change those id
s back at the group_by
stage. Consider the following code.
mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text) %>% group_by(id) %>% mutate(id = paste0(id, "-", seq_len(n())))
dflemma <-
spacy_parse(structure(df2$text, names = df2$id), lemma = TRUE, pos = FALSE) %>%
group_by(id = sub("(.+)-(.+)", "\\1", doc_id)) %>%
summarise(text = paste(lemma, collapse = " "))