Search code examples
rdplyrquanteda

From long to wide format with the same duplicates


Trying this command:

library("spacyr")
library("dplyr", warn.conflicts = FALSE)

mytext <- data.frame(text = c("test text", "section 2 sending"), 
                     id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text)

df3 <- data.frame(text = df2$text, id = df2$id)

dflemma <- spacy_parse(structure(df3$text, names = df3$id),
                       lemma = TRUE, pos = FALSE)  %>%
    mutate(id = doc_id) %>%
    group_by(id) %>%
    summarize(body = paste(lemma, collapse = " "))

the expected output is the long to wide format using the same id and separate the merge text with a space. Here the expected output

data.frame(text = c("test text", "section 2 send"), 
                     id = c(32,41)

However the command provide this error:

Error in process_document(x, multithread) : Docnames are duplicated.

Solution

  • You get this error because you separate each of your text phrases to words. You shouldn't do that. Consider the following code:

    mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
    dflemma <- 
      spacy_parse(structure(mytext$text, names = mytext$id), lemma = TRUE, pos = FALSE) %>% 
      group_by(id = doc_id) %>% 
      summarise(text = paste(lemma, collapse = " "))
    

    Output

    > dflemma
    # A tibble: 2 x 2
      id    text          
      <chr> <chr>         
    1 32    test text     
    2 41    section 2 send
    

    Update

    If you have to do the separation, then you need to further modify your id column to ensure that each observation in it is unique. Later you can change those ids back at the group_by stage. Consider the following code.

    mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
    df2 <- tidyr::separate_rows(mytext, text) %>% group_by(id) %>% mutate(id = paste0(id, "-", seq_len(n())))
    dflemma <- 
      spacy_parse(structure(df2$text, names = df2$id), lemma = TRUE, pos = FALSE) %>% 
      group_by(id = sub("(.+)-(.+)", "\\1", doc_id)) %>% 
      summarise(text = paste(lemma, collapse = " "))