how to feed a tibble to spacyr?

Consider this simple example

bogustib <- tibble(doc_id = c(1,2,3),
                   text = c('bug', 'one love', '838383838'))

# A tibble: 3 x 2
  doc_id text     
   <dbl> <chr>    
1      1 bug      
2      2 one love 
3      3 838383838

This tibble is called bogustib because I know spacyr will fail on row 3.

> spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "text1") : 
  replacement has 1 row, data has 0

so, naturally, feeding the tibble to spacyr will fail as well

spacy_parse(bogustib, lemma = FALSE, entity = TRUE, nounphrase = TRUE)
Error in `$<-.data.frame`(`*tmp*`, "doc_id", value = "3") : 
  replacement has 1 row, data has 0

My question is: I think I can avoid this issue by calling spacy_parse row by row.

However, this looks inefficient and I would like to use the multithread argument of spacyr to speed up the computation over my large tibble.

Is there any solution here? Thanks!

Solution

Actually, it does not happen in my environment. In my environment, the output is like:

library(tidyverse)
library(spacyr)

bogustib <- tibble(doc_id = c(1,2,3),
                   text = c('bug', 'one love', '838383838'))

spacy_parse(bogustib)

spacy_parse('838383838', lemma = FALSE, entity = TRUE, nounphrase = TRUE)
## No noun phrase found in documents.
##   doc_id sentence_id token_id     token pos     entity
## 1  text1           1        1 838383838 NUM CARDINAL_B

To get this result, I used the latest master on github. However, I was able to reproduce your error when I ran with the CRAN version of spacyr. I'm sure that I fixed the bug a while ago, but that seems not reflected on CRAN version. We will try to update the CRAN asap.

In the meantime, you can:

devtools::install_github('quanteda/spacyr')

Or zip download the repo and run:

devtools::install('******')

**** is the path to the unzipped repository.