Search code examples
rdataframepdftools

R - Show data in a data frame


With below code I extract data from a pdf file using pdftools:

library(pdftools)
library(readr)

download.file("https://www.stoxx.com/document/Reports/SelectionList/2020/August/sl_sxebmp_202008.pdf","sl_sxebmp_202008.pdf", mode = "wb")
txt <- pdf_text("sl_sxebmp_202008.pdf")

txt <- read_lines(txt)

print(txt)

How could I show these data as data.frame?


Solution

  • I would suggest a tabulizer approach using your file. You can use extract_tables() to get all data into a list and then process it. First element in the list will contain variable names so it is better to process this element first. The code to do that is next:

    library(tabulizer)
    #Read
    lst <- extract_tables(file = 'sl_sxebmp_202008.pdf') 
    #Format
    #Split elements as first element has variable names
    d1 <- lst[[1]]
    lst2 <- lst[2:length(lst)]
    #Process
    #Format first element
    d1 <- as.data.frame(d1,stringsAsFactors = F)
    names(d1) <- d1[1,]
    d1 <- d1[2:dim(d1)[1],]
    #Format list
    lst2 <- lapply(lst2,function(x) {x <- as.data.frame(x,stringsAsFactors=F)})
    #Bind all element in lst2
    d2 <- do.call(rbind,lst2)
    #Assign same names
    names(d2) <- names(d1)
    #Bind all
    d3 <- rbind(d1,d2)
    

    Some rows of the output d3 (1753 rows and 11 columns):

              ISIN   Sedol     RIC Int.Key Company Name Country Currency Component FF Mcap (BEUR)
    1 CH0038863350 7123870  NESN.S  461669       NESTLE      CH      CHF         Y          299.1
    2 CH0012032048 7110388   ROG.S  474577 ROCHE HLDG P      CH      CHF         Y          206.4
    3 CH0012005267 7103065  NOVN.S  477408     NOVARTIS      CH      CHF         Y          173.1
    4 DE0007164600 4846288 SAPG.DE  476361          SAP      DE      EUR         Y          146.4
    5 NL0010273215 B929F46 ASML.AS  546078    ASML HLDG      NL      EUR         Y          127.6
    6 GB0009895292 0989529   AZN.L  098952  ASTRAZENECA      GB      GBP         Y          124.2
      Rank\r(FINAL) Rank\r(PREVIO\rUS)
    1             1                  1
    2             2                  2
    3             3                  3
    4             4                  5
    5             5                  4
    6             6                  6