Search code examples
rtmtabulizer

Extract list based on string with tabulizer package


Extracting the quarterly income statement with the tabulizer package and converting it to tabular form.

# 2017 Q3 Report
telia_url = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2017/q3/telia-company-q3-2017-en"
telialists = extract_tables(telia_url)
teliatest1 = as.data.frame(telialists[22])

#2009 Q3#
telia_url2009 = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2009/q3/teliasonera-q3-2009-report-en.pdf"
telialists2009 = extract_tables(telia_url2009)
teliatest2 = as.data.frame(telialists2009[9])

Interested only in the Condensed Consolidated Statements of Comprehensive Income table. This string is exact or very similar for all historical reports.

Above, for the 2017 report, list #22 was the correct table. However, since 2009 report had a different layout, #9 was the correct for that particular report.

What would be a clever solution to make this function dynamic, depending on where the string (or substring) of "Condensed Consolidated Statements of Comprehensive Income" is located?

Perhaps using the tm package to find the relative position?

Thanks


Solution

  • You could use pdftools to find the page you're interested in.

    For instance a function like this one should do the job:

    get_table <- function(url) {
      txt <- pdftools::pdf_text(url)
      p <- grep("condensed consolidated statements.{0,10}comprehensive income", 
                txt,
                ignore.case = TRUE)[1]
      L <- tabulizer::extract_tables(url, pages = p)
      i <- which.max(lengths(L))
      data.frame(L[[i]])
    }
    

    The first step is to read all the pages in the character vector txt. Then grep allows you to find the first page looking like the one you want (I inserted .{0,10} to allow a maximum of ten characters like spaces or newlines in the middle of the title).

    Using tabulizer, you can extract the list L of all tables located on this page, which should be much faster than extracting all the tables of the document, as you did. Your table is probably the biggest on that page, hence the which.max.