I am trying to extract information from pdf files using R. The data I want are in tables although they arent recognised by R. I am using the pdftools to read in the pdf file, export it to a text file and then re read it in line by line. The files look like this.
I want to extract the Net cash from / (used in) operating activities but as you can see because the lines spill it makes it very hard.
pdf_text <- pdf_text("test.pdf")
write.table(pdf_text,"out.txt")
just <- readLines("input_file.txt")
> just[30:40]
[1] " (g) insurance costs - (137)"
[2] " 1.3 Dividends received (see note 3) - -"
[3] " 1.4 Interest received 9 21"
[4] " 1.5 Interest and other costs of finance paid - -"
[5] " 1.6 Income taxes paid - -"
[6] " 1.7 Government grants and tax incentives - -"
[7] " 1.8 Other (provide details if material) - -"
[8] " 1.9 Net cash from / (used in) operating"
[9] " (1,258) (3,785)"
[10] " activities"
I want to grab the numbers (1,258) and (3,785) still with the parentheses around them.
A common thing that happens is that the numbers will either be on line 8,9 or 10 (using my example above as reference) so I cant just simply write code to grab the data that is 'next' to "Net cash from / (used in) operating activities"
This code almost arrives at the desired result:
> text_file <- readLines("out.txt")
> operating_line <- grep("Net cash from / \\(used in\\) operat", text_file)
> operating_line <- operating_line[1]
> number_line1 <- text_file[operating_line]
> number_line2 <- text_file[operating_line + 1]
> number_line3 <- text_file[operating_line - 1]
> if (gsub("[^()[:digit:],]+", "", number_line1) != "") {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line1)
+ } else if (gsub("[^()[:digit:],]+", "", number_line2) != "") {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line2)
+ } else {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line3)
+ }
> numbers <- gsub("\\d+\\(\\)", "", numbers)
> numbers
[1] "(1,258)(3,785)"
However there is no gap between the (1,258) and (3,785). i.e. they are not being identified as different elements