Search code examples
rpdftext-miningcorpustidytext

From pdf text to tidy dataframe with file names in document column


I want to analyse text from almost 300 pdf documents. Now I used the pdftools and tm, tidytext packages to read the text, coverted it to a corpus, then to a document-term-matrix and I finally want to structure it in a tidy dataframe.

I've got a couple questions:

  • How do I get rid of page data (at the top and/or bottom of every pdf page)
  • I would rather want the filenames as the values in the document column instead of indexed numbers.
  • The following code contents only 2 pdf files for reproducibility. When I run all my files I get 294 documents in my corpus object, but when I tidy it I seem to loose some files because converted %>% distinct(document) gives 275 back. I wonder why that is.

I've got the The following reproducible script:

library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)

# Create a temporary empty directory 
# (don't worry at the end of this script I'll remove this directory and its files)

dir.create("~/Desktop/sample-pdfs")

# Fill directory with 2 pdf files from my github repo

download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/'s-Gravenhage_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/'s-Gravenhage_coalitieakkoord.pdf")
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/Aa%20en%20Hunze_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/Aa en Hunze_coalitieakkoord.pdf")

# Create vector of file paths

dir <- "~/Desktop/sample-pdfs"
pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")

# Read the text from pdf's with pdftools package

pdfs_text <- map(pdfs, pdf_text)

# Convert to document-term-matrix

converted <- Corpus(VectorSource(pdfs_text)) %>%
          DocumentTermMatrix()

# Now I want to convert this to a tidy format

converted %>%
          tidy() %>%
          filter(!grepl("[0-9]+", term))

With the following output:

# A tibble: 5,305 x 3
   document term           count
   <chr>    <chr>          <dbl>
 1 1        aan              158
 2 1        aanbesteding       2
 3 1        aanbestedingen     1
 4 1        aanbevelingen      1
 5 1        aanbieden          3
 6 1        aanbieders         1
 7 1        aanbod             8
 8 1        aandacht          16
 9 1        aandachtspunt      3
10 1        aandeel            1
# ... with 5,295 more rows

This seems to work out nicely but I would rather want the filenames ("'s-Gravenhage" and "Aa en Hunze") as the values in the document column instead of indexed numbers. How do I do this?

Desired output:

# A tibble: 5,305 x 3
   document      term           count
   <chr>         <chr>          <dbl>
 1 's-Gravenhage aan              158
 2 's-Gravenhage aanbesteding       2
 3 's-Gravenhage aanbestedingen     1
 4 's-Gravenhage aanbevelingen      1
 5 's-Gravenhage aanbieden          3
 6 's-Gravenhage aanbieders         1
 7 's-Gravenhage aanbod             8
 8 's-Gravenhage aandacht          16
 9 's-Gravenhage aandachtspunt      3
10 's-Gravenhage aandeel            1
# ... with 5,295 more rows

Delete downloaded files and its directory from desktop running the following line:

unlink("~/Desktop/sample-pdfs", recursive = TRUE)

All help is much appreciated! 💐


Solution

  • You can read the documents straight into a corpus with tm. the reader readPDF uses pdftools as an engine. No need to first create a set of text, put it through a corpus to get your output. I created 2 examples. The first one in line with what you were doing, but first going through a corpus. The second purely based on tidyverse + tidytext. No need for switching between tm, tidytext etc.

    The differences in number of tokens between the examples is due to automatic cleaning in tidytext / tokenizer.

    If you have a lot of documents to do, you might want to use quanteda to be your workhorse as that one can work on multiple cores out of the box and might speed up the tokenizer part. Don't forget to use the stopwords package for getting a good list of dutch stopwords. If you need POS tagging for Dutch words, you check the updipe package.

    library(tidyverse)
    library(tidytext)
    library(tm)
    
    directory <- "D:/sample-pdfs"
    
    # create corpus from pdfs
    converted <- VCorpus(DirSource(directory), readerControl = list(reader = readPDF)) %>% 
      DocumentTermMatrix()
    
    
    converted %>%
      tidy() %>%
      filter(!grepl("[0-9]+", term))
    
    # A tibble: 5,707 x 3
       document                          term           count
       <chr>                             <chr>          <dbl>
     1 's-Gravenhage_coalitieakkoord.pdf "\ade"             4
     2 's-Gravenhage_coalitieakkoord.pdf "\adeze"           1
     3 's-Gravenhage_coalitieakkoord.pdf "\aeen"            2
     4 's-Gravenhage_coalitieakkoord.pdf "\aer"             2
     5 's-Gravenhage_coalitieakkoord.pdf "\aextra"          2
     6 's-Gravenhage_coalitieakkoord.pdf "\agroei"          1
     7 's-Gravenhage_coalitieakkoord.pdf "\ahet"            1
     8 's-Gravenhage_coalitieakkoord.pdf "\amet"            1
     9 's-Gravenhage_coalitieakkoord.pdf "\aonderwijs,"     1
    10 's-Gravenhage_coalitieakkoord.pdf "\aop"            11
    # ... with 5,697 more rows
    

    Just using tidytext and not tm

    directory <- "D:/sample-pdfs"
    
    pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
    pdf_names <- list.files(directory, pattern = "*.pdf")
    pdfs_text <- map(pdfs, pdftools::pdf_text)
    
    
    my_data <- data_frame(document = pdf_names, text = pdfs_text)
    
    my_data %>% 
      unnest %>% # pdfs_text is a list
      unnest_tokens(word, text, strip_numeric = TRUE) %>%  # removing all numbers
      group_by(document, word) %>% 
      summarise(count = n())
    # A tibble: 4,646 x 3
    # Groups:   document [?]
       document                          word                    count
       <chr>                             <chr>                   <int>
     1 's-Gravenhage_coalitieakkoord.pdf 1e                          2
     2 's-Gravenhage_coalitieakkoord.pdf 2e                          2
     3 's-Gravenhage_coalitieakkoord.pdf 3e                          1
     4 's-Gravenhage_coalitieakkoord.pdf 4e                          1
     5 's-Gravenhage_coalitieakkoord.pdf aan                       164
     6 's-Gravenhage_coalitieakkoord.pdf aanbesteding                2
     7 's-Gravenhage_coalitieakkoord.pdf aanbestedingen              1
     8 's-Gravenhage_coalitieakkoord.pdf aanbestedingsprocedures     1
     9 's-Gravenhage_coalitieakkoord.pdf aanbevelingen               1
    10 's-Gravenhage_coalitieakkoord.pdf aanbieden                   4
    # ... with 4,636 more rows