Search code examples
rregexpdftextfinance

extracting information from pdfs that have line spills using R


I am trying to extract information from pdf files using R. The data I want are in tables although they arent recognised by R. I am using the pdftools to read in the pdf file, export it to a text file and then re read it in line by line. The files look like this.

I want to extract the Net cash from / (used in) operating activities but as you can see because the lines spill it makes it very hard.

pdf_text <- pdf_text("test.pdf")
write.table(pdf_text,"out.txt")
just <- readLines("input_file.txt")



> just[30:40]
 [1] "          (g) insurance costs                                                     -                  (137)"
 [2] " 1.3      Dividends received (see note 3)                                         -                      -"
 [3] " 1.4      Interest received                                                       9                     21"
 [4] " 1.5      Interest and other costs of finance paid                                -                      -"
 [5] " 1.6      Income taxes paid                                                       -                      -"
 [6] " 1.7      Government grants and tax incentives                                    -                      -"
 [7] " 1.8      Other (provide details if material)                                     -                      -"
 [8] " 1.9      Net cash from / (used in) operating"                                                             
 [9] "                                                                           (1,258)                 (3,785)"
[10] "          activities"   

I want to grab the numbers (1,258) and (3,785) still with the parentheses around them.

A common thing that happens is that the numbers will either be on line 8,9 or 10 (using my example above as reference) so I cant just simply write code to grab the data that is 'next' to "Net cash from / (used in) operating activities"


Solution

  • This code almost arrives at the desired result:

    > text_file <- readLines("out.txt")
    > operating_line <- grep("Net cash from / \\(used in\\) operat", text_file)
    > operating_line <- operating_line[1]
    > number_line1 <- text_file[operating_line]
    > number_line2 <- text_file[operating_line + 1]
    > number_line3 <- text_file[operating_line - 1]
    > if (gsub("[^()[:digit:],]+", "", number_line1) != "") {
    +   numbers <- gsub("[^()[:digit:],]+", "", number_line1)
    + } else if (gsub("[^()[:digit:],]+", "", number_line2) != "") {
    +   numbers <- gsub("[^()[:digit:],]+", "", number_line2)
    + } else {
    +   numbers <- gsub("[^()[:digit:],]+", "", number_line3)
    + }
    > numbers <- gsub("\\d+\\(\\)", "", numbers)
    > numbers
    [1] "(1,258)(3,785)"
    

    However there is no gap between the (1,258) and (3,785). i.e. they are not being identified as different elements