Search code examples
rnlpquanteda

How to get basic readability statistics using quanteda in R


For very basic insights in a couple of hundred pdf's, I want to calculate the readability score (Flesch Kincaid) of all these pdf's and present them in a spreadsheet. My skills in R are inadequate and I can't find the solution myself. I'm looking for a very basic solution. This is what I have so far:

directory <- "my_folder"
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf),
                     readerControl = list(reader = readPDF, language = "dutch"))

however, when using quanteda, I get the error message: 'row names supplied are of the wrong lenght' when using the following

textstat_readability(corpus(my_corpus), measure = "Flesch.Kincaid")

Is there a way to remedy this, or does an alternative exist?


Solution

  • Yes - avoid the tm workflow.

    directory <- "my_folder"
    my_corpus <- readtext::readtext(paste0(directory, “/*.pdf”))
    textstat_readability(corpus(my_corpus))
    

    But keep in mind that the syllable count function required by many readability measures may not operate correctly in Dutch.