Search code examples
rpdfdata-mining

R Language - Extracting the correct Data Type in a PDF Table


Please, consider the following table found in a PDF file:

Table With Models ans Quantities

I can download and extract the table with the following code:

url <- "https://www.fenabrave.org.br/portal/files/2023_01_2.pdf"

download.file(url, 'cars.pdf', mode="wb")

library(tabulapdf)

df <- extract_tables(
  'cars.pdf',
  pages = 27,
  area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
  guess = FALSE)
  bind_rows() |>
  set_names(c("Model","Quantity"))

Unfortunately, the function reads the quantities as a double type and removes all the zeros at the end.

I can add the following code to change its class:

df <- extract_tables(
  'cars.pdf',
  pages = 27,
  area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
  guess = FALSE) |> 
  bind_rows() |>
  set_names(c("Model","Quantity"))|>
  mutate(Quantity = gsub("\\.", "", Quantity))|>
  mutate(Quantity = as.integer(Quantity))

But the damage is already done: 2.830, became 283; 1.220 became 122.

Is there a way to make the data be read as char?


Solution

  • extract_tables lets you read data in as character vector. This can be parsed using fread with colClasses = 'character' to read in all columns as characters. Then you can use gsub("\\.","",df2$quantity) to remove the thousands-dot.

    library(tabulapdf)
    
    string <- tabulapdf::extract_tables("table.pdf", output = "character")|> 
      unlist()
    
    library(data.table)
    
    df2 <- fread(string, colClasses = 'character', data.table = FALSE)
    
    df2$quantity <- gsub("\\.","",df2$quantity)
    

    giving

      Head1 quantity
    1     1     3678
    2     2     3093
    3     3     2830
    4     4     2770
    5     5     2200
    

    Test PDF

    out