Search code examples
rpdftools

R Find element of the list to extract table from pdf


I'm trying to use pdftools package to extract data table from a pdf. My source file is here: https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf. Say, I want to extract data from Table 20 on page 170 (Change in Nominal house price)

I use the following code:

install.packages("pdftools")
library(pdftools)

report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")

tab20 <- as.data.frame(report[170])

To get the proper table I had to manually indicate that I want to extract 170th element of the list (as the table is on page 170). If next year, a new page with table is added to the report, I will have to modify the code to extract 171th element. Is there a way to do it in a more automated manner?

Basically, what I need to do is to find the element of the list that contains string "Change in Nominal house price". Any suggestion how to do it?


Solution

  • Another solution, based on purrr::map_lgl:

    library(tidyverse)
    library(pdftools)
    
    report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")
    
    map_lgl(
      report,
      ~ str_detect(
        str_c(.x$text, collapse = " "),
        "Change in Nominal house price")) %>% report[.]
    
    #> [[1]]
    #> # A tibble: 606 × 6
    #>    width height     x     y space text       
    #>    <int>  <int> <int> <int> <lgl> <chr>      
    #>  1    59     14    39    38 TRUE  STATISTICAL
    #>  2    35     14   102    38 FALSE TABLES     
    #>  3    25     26    33    81 TRUE  20.        
    #>  4    60     26    65    81 TRUE  Change     
    #>  5    15     26   129    81 TRUE  in         
    #>  6    67     26   149    81 TRUE  Nominal    
    #>  7    47     26   221    81 TRUE  house      
    #>  8    41     26   272    81 FALSE price      
    #>  9    30     14    65   103 TRUE  Annual     
    #> 10     7     14    98   103 TRUE  %          
    #> # … with 596 more rows