So I'm doing a project where I need to load a numerous amount of .pdfs into R. This part is somewhat covered. The problem is when importing the pdfs into R, every line is a string. Not all the information in de the string is relevant. And in some of the cases information is missing. So I want to select the info I need and place them into a tibble for further analysis.
Importing the pdf's are done by pdftools. It's working, hints or tips are welcome though
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
reproducible example:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
So here is where the problem starts. What I've tried is using stringr and rebus to select specific parts of the text. I've made the following function to search the document for specific string, it returns the rownumber:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
And the following searchpatterns:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
The next step should be to make a tibble (or data frame) with the column names c("date", "reference", "product", "product reference", "weight", "amount")
I've also tried making a tibble of the whole invoice_example
problem is the missing info in some fields and the column names don’t match the corresponding value's.
So I would like to make some function that uses the search pattern and places that specific value to a predestined column. I've got no clue how to get this done. Or maybe I should handle this completely different?
final result should be something like this.
reproducible example:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
Result:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
Any suggested ways of dealing with this will be appreciated!
Actually you can use the functions of library(stringr)
to achieve your goal (I skipped the rebus
part as this seems to eb anyways 'just' a helper for creatign teh regex, which I did by hand):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\\d{2}-\\d{2}-\\d{4}",
reference = "\\d{9}",
product_id = "[a-z]{2}\\d{7}",
weight = "\\d+\\.\\d+ kg",
amount = "\\d+,\\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\\d+.\\d+) kg", "\\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example)) # Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables: # $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023" # $ date : Date, format: "2016-02-01" "2016-02-07" ... # $ reference : chr "840000023" "840000048" "840000032" "840000027" # $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465" # $ weight : num 14.9 19 4 1.78 # $ amount : num 745 950 250 89