Search code examples
rpdfweb-scrapingpdf-scraping

Why can't I clean pdf table and rename columns as a function?


I figured out how to scrape this PDF, but I have a lot of these files that I need to go through. My intention was to set this as a function, import data from all of the pdfs (one pdf per month for several years) and then do an rbind() to make one data table that I can then write as a csv.

This works.

library(tidyverse)
library(tabulizer)

#import the data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")

#create data frame
cleanNvsen <- do.call(rbind, jan16s_raw)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])

#rename all of the columns
names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"

#check to see if it worked
head(example)

But this results in a 1 x 1 data frame

library(tidyverse)
library(tabulizer)

#load data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")

#create function to create data frame and then rename 
clean <- function(x) {
cleanNvsen <- do.call(rbind, x)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])

names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"
}

x2 <- clean(jan16s_raw)

head(x2)

I'd really like to get this to work so that I can just feed R the url's and then run this clean function I've created. I have dozens of files to go through.


Solution

  • You can write the clean function to extract the data and renaming the columns. We can rename multiple columns at once and don't need to rename them individually.

    clean <- function(url) {
      jan16s_raw <- extract_tables(url)
      #create data frame
      cleanNvsen <- do.call(rbind, jan16s_raw)
      cleanNvsen2 <- as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
      #rename all of the columns
      names(cleanNvsen2) <- c("District", "Democrat", "Independent American", 
                      "Libertarian","Nonpartisan","Other","Republican","Total")
    
      return(cleanNvsen2)
    }
    

    Create a vector of all the urls from which you want to extract the data.

    list_of_urls <- c('https://www.nvsos.gov/sos/home/showdocument?id=4062', 
                      'https://www.nvsos.gov/sos/home/showdocument?id=4064')
    

    Then call clean function for each of the url and combine the data.

    all_data <- purrr::map_df(list_of_urls, clean)
    #OR
    #all_data <- do.call(rbind, lapply(list_of_urls, clean))