I want to analyse text from almost 300 pdf documents. Now I used the pdftools
and tm
, tidytext
packages to read the text, coverted it to a corpus, then to a document-term-matrix and I finally want to structure it in a tidy dataframe.
I've got a couple questions:
document
column instead of indexed numbers.corpus
object, but when I tidy it I seem to loose some files because converted %>% distinct(document)
gives 275 back. I wonder why that is.I've got the The following reproducible script:
library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)
# Create a temporary empty directory
# (don't worry at the end of this script I'll remove this directory and its files)
dir.create("~/Desktop/sample-pdfs")
# Fill directory with 2 pdf files from my github repo
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/'s-Gravenhage_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/'s-Gravenhage_coalitieakkoord.pdf")
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/Aa%20en%20Hunze_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/Aa en Hunze_coalitieakkoord.pdf")
# Create vector of file paths
dir <- "~/Desktop/sample-pdfs"
pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")
# Read the text from pdf's with pdftools package
pdfs_text <- map(pdfs, pdf_text)
# Convert to document-term-matrix
converted <- Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix()
# Now I want to convert this to a tidy format
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term))
With the following output:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 1 aan 158
2 1 aanbesteding 2
3 1 aanbestedingen 1
4 1 aanbevelingen 1
5 1 aanbieden 3
6 1 aanbieders 1
7 1 aanbod 8
8 1 aandacht 16
9 1 aandachtspunt 3
10 1 aandeel 1
# ... with 5,295 more rows
This seems to work out nicely but I would rather want the filenames ("'s-Gravenhage"
and "Aa en Hunze"
) as the values in the document column instead of indexed numbers. How do I do this?
Desired output:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 's-Gravenhage aan 158
2 's-Gravenhage aanbesteding 2
3 's-Gravenhage aanbestedingen 1
4 's-Gravenhage aanbevelingen 1
5 's-Gravenhage aanbieden 3
6 's-Gravenhage aanbieders 1
7 's-Gravenhage aanbod 8
8 's-Gravenhage aandacht 16
9 's-Gravenhage aandachtspunt 3
10 's-Gravenhage aandeel 1
# ... with 5,295 more rows
Delete downloaded files and its directory from desktop running the following line:
unlink("~/Desktop/sample-pdfs", recursive = TRUE)
All help is much appreciated! 💐
You can read the documents straight into a corpus with tm. the reader readPDF uses pdftools as an engine. No need to first create a set of text, put it through a corpus to get your output. I created 2 examples. The first one in line with what you were doing, but first going through a corpus. The second purely based on tidyverse + tidytext. No need for switching between tm, tidytext etc.
The differences in number of tokens between the examples is due to automatic cleaning in tidytext / tokenizer.
If you have a lot of documents to do, you might want to use quanteda
to be your workhorse as that one can work on multiple cores out of the box and might speed up the tokenizer part. Don't forget to use the stopwords
package for getting a good list of dutch stopwords. If you need POS tagging for Dutch words, you check the updipe
package.
library(tidyverse)
library(tidytext)
library(tm)
directory <- "D:/sample-pdfs"
# create corpus from pdfs
converted <- VCorpus(DirSource(directory), readerControl = list(reader = readPDF)) %>%
DocumentTermMatrix()
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term))
# A tibble: 5,707 x 3
document term count
<chr> <chr> <dbl>
1 's-Gravenhage_coalitieakkoord.pdf "\ade" 4
2 's-Gravenhage_coalitieakkoord.pdf "\adeze" 1
3 's-Gravenhage_coalitieakkoord.pdf "\aeen" 2
4 's-Gravenhage_coalitieakkoord.pdf "\aer" 2
5 's-Gravenhage_coalitieakkoord.pdf "\aextra" 2
6 's-Gravenhage_coalitieakkoord.pdf "\agroei" 1
7 's-Gravenhage_coalitieakkoord.pdf "\ahet" 1
8 's-Gravenhage_coalitieakkoord.pdf "\amet" 1
9 's-Gravenhage_coalitieakkoord.pdf "\aonderwijs," 1
10 's-Gravenhage_coalitieakkoord.pdf "\aop" 11
# ... with 5,697 more rows
Just using tidytext and not tm
directory <- "D:/sample-pdfs"
pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, pdftools::pdf_text)
my_data <- data_frame(document = pdf_names, text = pdfs_text)
my_data %>%
unnest %>% # pdfs_text is a list
unnest_tokens(word, text, strip_numeric = TRUE) %>% # removing all numbers
group_by(document, word) %>%
summarise(count = n())
# A tibble: 4,646 x 3
# Groups: document [?]
document word count
<chr> <chr> <int>
1 's-Gravenhage_coalitieakkoord.pdf 1e 2
2 's-Gravenhage_coalitieakkoord.pdf 2e 2
3 's-Gravenhage_coalitieakkoord.pdf 3e 1
4 's-Gravenhage_coalitieakkoord.pdf 4e 1
5 's-Gravenhage_coalitieakkoord.pdf aan 164
6 's-Gravenhage_coalitieakkoord.pdf aanbesteding 2
7 's-Gravenhage_coalitieakkoord.pdf aanbestedingen 1
8 's-Gravenhage_coalitieakkoord.pdf aanbestedingsprocedures 1
9 's-Gravenhage_coalitieakkoord.pdf aanbevelingen 1
10 's-Gravenhage_coalitieakkoord.pdf aanbieden 4
# ... with 4,636 more rows