Search code examples
rdplyrpdf-scraping

converting plain text to data frame using dplyr in r


I'm trying to use r convert plain text scraped from a pdf with pdftools and tidyverse into a data frame. I'm hoping for a solution using tidyverse packages. I've used the following code to get to a list of strings with my essential information:

library(tidyverse)
library(pdftools)

textdf <- pdf_text("raw pdf.pdf")

all_stats_lines <- textdf[3:28]%>%
  str_squish()%>%
  str_replace_all(",", "")%>%
  str_remove_all("\\+80% \\+80% \\+80% \\+40% \\+40% \\+40% Baseline Baseline Baseline \\-40% \\-40% 
  \\-40% \\-80% \\-80% \\-80% Sun Feb 16 Sun Mar 8 Sun Mar 29 Sun Feb 16 Sun Mar 8 Sun Mar 29 Sun Feb 
  16 Sun Mar 8 Sun Mar 29")%>% 
  str_remove_all("compared to baseline")%>%
  strsplit("   ")



This yields the following list of 26 lists of strings in the following format:


[[1]]
[1] "Alaska Variable 1 Variable 2 Variable 3 42  15  5" 
[2] "Variable 4 Variable 5 Variable 6 43  30  11"              
[3] "Alabama Variable 1 Variable 2 Variable 3 27  9  79"
[4] "Variable 4  Variable 5 Variable 6 20  23  4  "          

[[2]]
[1] "Arizona Variable 1 Variable 2 Variable 3 40  17  7" 
[2] "Variable 4 Variable 5 Variable 6 41  33  10"              
[3] "Arkansas Variable 1 Variable 2 Variable 3 29  7  81"
[4] "Variable 4  Variable 5 Variable 6 22  27  7  "   

... etc.

Note the state names at the beginning of sub list rows 1 and 3 and spaces within variable names. Each state should be one row. There should be 6 columns Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 with the corresponding values in order.

Any solution for how to build this table?


Solution

  • You have several problems you need to solve in order to do this.

    1. splitting text to columns
    2. combining pairs of rows
    3. combining lists

    If you are only going to do this once, it would probably be simpler to copy your data into Excel, format it there and then load it into R. But if you are committed to using R, then we need to work through each of these in order:

    1) Text to columns

    strsplit(string, " ") will split text strings on spaces. strsplit(string, "[[:space:]]+") will split on white space, treating consecutive white space as a single split. However you want to avoid separating "Variable 1" into "Variable" and "1", and separating "North Dakota" into "North" and "Dakota"

    A quick solution here is gsub("North ", "North_", string) which will turn "North Dakota" into "North_Dakota" so the two words remain together when you apply strsplit.

    2) Combining pairs of rows

    You can use modulo to extract every second entry of a vector: 1:4 %% 2 will return c(1,0,1,0). This can be used to extract every second value like so: vec[1:4 %% 2 == 1].

    Combining these in pairs gives you something like the following:

    vec = c('a', 'b', 'c', 'd')
    paste(vec[1:4 %% 2 == 1], vec[1:4 %% 2 == 0])
    

    3) Combining lists

    The easiest solution here is probably unlist. But you could also use the reduce function in the purrr package.

    Combining it all

    data = unlist(data)
    data = trimws(data)
    nn = length(data)
    data = paste(data[1:nn %% 2 == 1], data[1:nn %% 2 == 0])
    
    # add other rules here for state names that are two words
    data = gsub("ariable ", "ariable_", data)
    data = gsub("North ", "North_", data)
    
    data %>%
      strsplit("[[:space:]]+") %>%
      purrr::reduce(rbind) %>%
      as.data.frame()