Search code examples
rrjavapdf-scraping

Creating columns from scraped pdf with cuts on spaces


I'm trying to create a data frame from the following PDF

library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)

However, when I call tab1 it only has one column:

      [,1]                                                                     
 [1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
 [2,] "AS OF JUNE 29, 2020 AT 3:00 PM"                                         
 [3,] "POSITIVE CASE STATUS OTHER TESTS"                                       
 [4,] "TOTAL"                                                                  
 [5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"                  
 [6,] "TOTAL 495 16 519 97 805"                                                
 [7,] "ADIRONDACK 0 0 0 75 0"                                                  
 [8,] "ALBION 0 0 0 0 2"                                                       
 [9,] "ALTONA 0 0 0 0 1"  

                                                 

I would like to extract what should be the individual columns to create a dataframe (e.g. for row 7 I extract its contents into the following columns: Facility ("Adirondack") Recovered (0) Decesased (0) Positive (0) Pending (75) Negative (0) ). I'm thinking that the most efficient way to do this would be to make cuts in tab1 based on spaces, but this doesn't work since some of the facilities have multiple words in them, so the space cut would get messed up. Does anyone have an idea for a solution? Thanks for the help!


Solution

  • Here is a workaround:

    library(tabulizer)
    
    url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
    tab1 <- extract_tables(url)
    
    plouf <- tab1[[1]][6:dim(tab1[[1]])[1],] 
    plouf <- gsub("([A-Z]+) ([A-Z]+)","\\1_\\2",plouf)
    df <- read.table(text = paste0(t(plouf) ,collapse = "\n\r"),sep = " ")
    names(df) <- strsplit(tab1[[1]][5,]," ")[[1]]
    
               FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE
    1             TOTAL       495       16      519      97      805
    2        ADIRONDACK         0        0        0      75        0
    3            ALBION         0        0        0       0        2
    4            ALTONA         0        0        0       0        1
    5            ATTICA         2        0        2       1        7
    6            AUBURN         0        0        0       0       10
    7         BARE_HILL         0        0        0       0        6
    8     BEDFORD_HILLS        43        1       44       5       53
    9      CAPE_VINCENT         0        0        0       0        0
    10           CAYUGA         0        0        0       2        1
    11          CLINTON         1        0        1       0       25
    12          COLLINS         1        0        1       0       13
    13        COXSACKIE         1        0        1       0       57
    14        DOWNSTATE         1        0        1       0       12
    15          EASTERN        17        1       20       0       17
    16        EDGECOMBE         0        0        0       0        0
    17           ELMIRA         0        0        0       1       20
    18         FISHKILL        78        5       83       4       98
    19      FIVE_POINTS         0        0        0       0        4
    20         FRANKLIN         1        0        1       0       24
    

    I take the table after the title, then remove the spaces between the FACILITY names with gsub (I actually replace them with _, so you can rechange to space after if you want. You can also use str_replace from stringr instead of gsub).

    I then use read.table, forcing the text with a end of line after each line. I add the name after (because if not, they get changed in the gsub and read.table do not read them properly).