Search code examples
tmquanteda

Error when importing tm Vcorpus into Quanteda corpus


This code snippet worked just fine until I decided to update R(3.6.3) and RStudio(1.2.5042) yesterday, though it is not obvious to me that is the source of the problem.

In a nutshell, I convert 91 pdf files into a volatile corpus named Vcorp and confirm that I created a volatile corpus as follows:

> Vcorp <- VCorpus(VectorSource(citiesText)) 
> class(Vcorp)
[1] "VCorpus" "Corpus" 

Then I attempt to import this tm Vcorpus into quanteda, but keep getting an error message, which I did not get before (eg the day before the update).

> data(Vcorp, package = "tm")   
> citiesCorpus <- corpus(Vcorp)
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 8714, 91 

Any suggestions? Thank you.


Solution

  • Impossible to know the exact problem without a) version information on your packages and b) a reproducible example.

    Why use tm at all? You could have created a quanteda corpus directly as:

    corpus(citiesText)
    

    Converting a VCorpus works fine for me.

    library("quanteda")
    ## Package version: 2.0.1
    
    library("tm")
    packageVersion("tm")
    ## [1] ‘0.7.7’
    
    reut21578 <- system.file("texts", "crude", package = "tm")
    VCorp <- VCorpus(
      DirSource(reut21578, mode = "binary"),
      list(reader = readReut21578XMLasPlain)
    )
    
    corpus(VCorp)
    ## Corpus consisting of 20 documents and 16 docvars.
    ## text1 :
    ## "Diamond Shamrock Corp said that effective today it had cut i..."
    ## 
    ## text2 :
    ## "OPEC may be forced to meet before a scheduled June session t..."
    ## 
    ## text3 :
    ## "Texaco Canada said it lowered the contract price it will pay..."
    ## 
    ## text4 :
    ## "Marathon Petroleum Co said it reduced the contract price it ..."
    ## 
    ## text5 :
    ## "Houston Oil Trust said that independent petroleum engineers ..."
    ## 
    ## text6 :
    ## "Kuwait"s Oil Minister, in remarks published today, said ther..."
    ## 
    ## [ reached max_ndoc ... 14 more documents ]