Search code examples
rspacyquanteda

how to feed a tibble to spacyr?


Consider this simple example

bogustib <- tibble(doc_id = c(1,2,3),
                   text = c('bug', 'one love', '838383838'))

# A tibble: 3 x 2
  doc_id text     
   <dbl> <chr>    
1      1 bug      
2      2 one love 
3      3 838383838

This tibble is called bogustib because I know spacyr will fail on row 3.

> spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "text1") : 
  replacement has 1 row, data has 0

so, naturally, feeding the tibble to spacyr will fail as well

spacy_parse(bogustib, lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "3") : 
  replacement has 1 row, data has 0

My question is: I think I can avoid this issue by calling spacy_parse row by row.

However, this looks inefficient and I would like to use the multithread argument of spacyr to speed up the computation over my large tibble.

Is there any solution here? Thanks!


Solution

  • Actually, it does not happen in my environment. In my environment, the output is like:

    library(tidyverse)
    library(spacyr)
    
    bogustib <- tibble(doc_id = c(1,2,3),
                       text = c('bug', 'one love', '838383838'))
    
    spacy_parse(bogustib)
    
    spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
    ## No noun phrase found in documents.
    ##   doc_id sentence_id token_id     token pos     entity
    ## 1  text1           1        1 838383838 NUM CARDINAL_B
    
    

    To get this result, I used the latest master on github. However, I was able to reproduce your error when I ran with the CRAN version of spacyr. I'm sure that I fixed the bug a while ago, but that seems not reflected on CRAN version. We will try to update the CRAN asap.

    In the meantime, you can:

    devtools::install_github('quanteda/spacyr')
    

    Or zip download the repo and run:

    devtools::install('******')
    

    **** is the path to the unzipped repository.