Search code examples
rweb-scrapingtidyversepdf-scrapingtabulizer

Tabulizer package in R: how to scrape tables after specific Title


How to scrape tables preceded with some title text from PDF? I am experimenting with tabulizer package. Here an example of getting a table from a specific page (Polish "Map of Public Health Needs")

library(tabulizer)
library(tidyverse)
options(java.parameters = "-Xmx8000m")

location<-"http://www.mpz.mz.gov.pl/wp-content/uploads/sites/4/2019/01/mpz_choroby_ukladu_kostno_miesniowego_woj_dolnoslaskie.pdf"

(out<-extract_tables(location, pages = 8,encoding = "UTF-8", method = "stream", outdir = getwd())[[4]] %>%
as.tibble())

This gets me one table at specific page. But I will have plenty of such pdfs to scrape, from the site: http://www.mpz.mz.gov.pl/mapy-dla-30-grup-chorob-2018/ and then subpages with many links for each illness, getting the links with rvest, for each province of Poland and I need to scrape tables after a specific title string eg.

Tabela 1.2.2: Struktura zapadalnosci rejestrowanej w zależności od płci, miejsca zamieszkania oraz grupy wiekowej - Choroby układowe tkanki łącznej"

I need to detect Tabela(...) Struktura zapadalnosci(...)", because the tables may not be at the same page. Many thanks for any directions and ideas in advance.

EDIT: After I asked the question I succeeded so far to find pages where the table might be, maybe very ineffective:

library(pdfsearch)

pages <-
  keyword_search(
    location,
    keyword = c(
      'Tabela',
      'Struktura zapadalnosci rejestrowanej'
    ),
    path = TRUE,
    surround_lines = FALSE
  ) %>%
  group_by(page_num) %>%
  mutate(keyword = paste0(keyword, collapse = ";")) %>%
  filter(
    str_detect(keyword, "Tabela") &
      str_detect(keyword, "Struktura zapadalnosci rejestrowanej")
  ) %>%
  pull(page_num) %>%
  unique()

Solution

  • I can help you with your basic problem, but there is one catch (see at the end). I use pdftools instead of pdfsearch but it's basically doing the same in this case (finding pages with a table). In order to save time, I only download the PDF once at the start:

    options(java.parameters = "-Xmx8000m")# needs to be set before loading tabulizer
    library(tabulizer)
    library(tidyverse)
    
    location <- "http://www.mpz.mz.gov.pl/wp-content/uploads/sites/4/2019/01/mpz_choroby_ukladu_kostno_miesniowego_woj_dolnoslaskie.pdf"
    download.file(location, "test.pdf", mode = "wb")
    

    Now convert the pdf to a data.frame with each line in a row of the df:

    raw <- pdftools::pdf_data("test.pdf") 
    pages <- lapply(seq_along(raw), function(p) {
      if (nrow(raw[[p]]) > 0) {
        raw[[p]]$page <- p
        raw[[p]]
      }
    }) %>% 
      bind_rows() %>% 
      group_by(y, page) %>% 
      summarise(text = paste(text, collapse = " ")) %>% 
      arrange(page, y)
    

    This data.frame is searchable and we keep only the lines fitting your keyword:

    tables <- pages %>% 
      filter(grepl("Tabela .* Struktura zapadalnosci", text))
    

    There are 8 lines which fit the keyphrase. We only extract tables from these. Furthermore, the function within the lapply loop only keeps the matrix with the most rows. If there are two tables on one page that might be a problem but generally it works well to use only the 'best guess' that tabulizer made finding the table structure.

    tables_list <- lapply(tables$page, function(p) {
      cat(p, "\n")
      out <- extract_tables("test.pdf", 
                            pages = p,
                            encoding = "UTF-8", 
                            method = "stream", 
                            output = "matrix")
      out <- as_tibble(out[[which.max(sapply(out, nrow) + sapply(out, ncol))]]) # keep the biggest table
      attr(out, "caption") <- tables$text[tables$page %in% p]
      return(out)
    })
    

    The object tables_list now contains a list of data.frames, each a converted table:

    > tables_list[[1]]
    # A tibble: 16 x 8
       V1                  V2    V3    V4    V5    V6    V7    V8   
       <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <chr>
     1 dolnośląskie        77,05 74,65 4,04  10,59 13,37 27,87 44,14
     2 kujawsko-pomorskie  78,12 65,93 4,29  14,96 14,82 27,01 38,92
     3 lubelskie           76,50 56,83 2,67  14,83 17,00 29,00 36,50
     4 lubuskie            79,10 76,23 4,92  12,70 12,70 30,74 38,93
     5 łódzkie             74,37 67,77 6,45  13,84 15,09 30,03 34,59
     6 małopolskie         72,71 55,35 6,99  14,63 12,01 25,87 40,50
     7 mazowieckie         76,31 68,52 5,89  12,11 12,30 27,03 42,67
     8 opolskie            79,55 54,65 4,83  10,04 17,47 26,02 41,64
     9 podkarpackie        75,10 47,32 7,57  14,86 18,29 25,31 33,98
    10 podlaskie           74,18 68,00 5,82  10,55 17,09 32,36 34,18
    11 pomorskie           76,57 74,96 5,71  12,74 13,76 26,65 41,14
    12 śląskie             73,51 81,15 4,89  14,96 14,43 26,64 39,08
    13 świętokrzyskie      74,45 56,51 4,91  14,00 14,74 27,27 39,07
    14 warmińsko-mazurskie 75,91 63,22 5,62  13,59 18,48 29,53 32,79
    15 wielkopolskie       72,66 62,71 3,62  14,37 14,77 29,45 37,79
    16 zachodniopomorskie  74,26 73,21 8,44  13,71 11,60 24,89 41,35
    

    I also added the (first line of the) caption of each table as an attribute to the data.frame:

    > attr(tables_list[[1]], "caption")
    [1] "Tabela 1.2.2: Struktura zapadalnosci rejestrowanej w zależności od płci, miejsca zamieszkania oraz grupy"
    

    Compare this to the pdf:

    enter image description here

    It seems this worked well, except that the column names are gone. Not sure if there is a way to retain them but that was not included in your question, so maybe you already have a solution?