Search code examples
rdplyrmapplypdftools

How to mutate a large number of columns onto a data frame at once in R using a custom function with pdftools and html links?


Sorry if this is long or not structured correctly, its my first question and first major R side project! Let me know if I should change anything about my questions for the future.

I am currently working with some city traffic data that is stored rather oddly. Rather than storing the data for each intersection within the downloadable csv files, a link to a website containing html links to all previous traffic surveys in a PDF form is present.

I have extracted the most recent PDF for each traffic intersection, but am having trouble getting my function to read in a PDF and return the data to work when input to mapply and mutate. I previously wrote a function capable of taking in the format of the PDF and the PDF link as input and returning a 1 row, 55 column data frame containing all of the traffic data for that intersection.

Now, I can't seem to get the function to work in mapply/with mutate. The function is below, but it takes two inputs, the type of survey used for the PDF and the PDF link to the PDF in question, and returns the data frame mentioned above.

When simply using mapply by itself, I think that the function uses the entire survey type column / entire PDF link column as a whole for the variables in the function, rather than looping over the entire columns.

When using mapply inside of mutate, it appears to loop correctly over the two columns, but I am unsure of how to use mutate to correctly add a large number of columns at once. Ideally, I would simply make a vector of the names of the columns in the proper order and use mutate or some mutate-like function to assign the results of mapply to the list of columns, i.e.

mutate-like(traffic_westwood, col-names = mapply(get_data, SURVEY_TYPE, PDF))

I thus have two questions:

  1. How does mapply work in relation to the code I have below? Am I misunderstanding how mapply loops over the two columns, or is my function not properly vectorized? Should I use a different looping function than mapply for this task?

  2. Is there a function I can use to assign the results of looping over the two columns of interest and extracting the PDF data to a list of column names and then add these columns to my data frame?

Previously, I tried putting the mapply call in mutate, as discussed above, but it still requires a way to assign the 55 columns to a list of names.

Additionally, I switched to mapply rather than lapply when I realized it was easier to loop over the two columns I needed rather than try and perform my custom function on the entire rows.

Please also note that the SURVEY_TYPE columns and PDF columns used in the mapply call are both the correct dimension with acceptable values in every slot: I have already checked this.

Please note that my code would ideally work with any input finite data frame of any dimensions containing the following two columns, SURVEY_TYPE and PDF. SURVEY_TYPE contains 'Auto' to indicate an Automatic survey type and 'Manual' to indicate a manual survey type. PDF contains a vector of strings/PDF html links for a row where SURVEY_TYPE = 'Auto' and a single string/PDF html link if SURVEY_TYPE = 'Manual' for that row.

For examples of an LA City manual traffic survey PDF link, please examine the error message of closed connections printed below, which should contain several such links.

#This function obtains a data frame with rows of 55 entries for each intersection with manual data.
#Automatic data only intersections have blank rows for now.
get_data <- function(survey_type, pdf_link){

  #Return data frame of empty row if type was auto. I'm fine on this part.
  if(survey_type == 'Auto'){
    ...
    return(data.frame(#Whatever I want for auto, empty 1 row 55 cols for now))
  }

  #Now we read in this entries manual pdf data.

  #Read in the first pdf.
  pdf1 <- pdf_text(toString(pdf_link))


  #Get the handle for this file..
  file1 <- file("Traffic_Data_Files/pdf1.txt", 'w')

  #Write the PDF contents to the file.
  write(pdf1, file = file1, sep = '\t')

  #Close the file and reopen it.
  close(file1)
  file1 <- file("Traffic_Data_Files/pdf1.txt", 'r')

  #Here is where I had code to get the data frame we will return and extract the 
  #data to it. Note that #the data frame will be 1 row and 55 columns and will  
  #be called 'intersection1'.
  #I have tested this code, which works on a single file to return such data.

  #Return the info for this intersection.
  return(intersection1)

}

#Use lapply on the traffic data.
manual_intersections <- mutate(traffic_westwood, Data = mapply(get_data, SURVEY_TYPE, PDF))

Ideally, I would like to directly append the 55 rows of filled in information onto the current data frame. Of course, I could just get a new data frame with the rows and append it to the old one if necessary as an intermediate step.

The following error message occurs when I keep the mapply call outside of the mutate call on the last line above (to me it appears to be that it can't tell the columns are vectors and does the first check for SURVEY_TYPE == 'Auto' on the first component of the column rather than element-wise):

Error in open.connection(con, "rb") : cannot open the connection
In addition: Warning messages:
1: In if (survey_type == "Auto") { :
  the condition has length > 1 and only the first element will be used
2: In open.connection(con, "rb") :
  cannot open URL 'http://navigatela.lacity.org/dot/traffic_data/manual_counts/46972_WESWEY96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/46974_KINWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/Gayley.Weyburn.180927-NDSMAN.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12599_GAYVET96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/gaylec06.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12690_BROLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12691_BROXTON.WEYBURN07.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12995_HILLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13531_LECWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13997_VETWIL081021.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/SELWIL080319.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/16896_GLELIN96.pdf, http [... truncated]

When I call mapply in mutate as shown in the code and attempt to assign the call to a single variable, on first glance I believe that it cannot assign the whole resulting row to a single column entry. Thus my question on how I could assign a large number of columns to a list of names using an arbitrary custom function in mutate or a mutate like function:

Error in eval(substitute(expr), envir, enclos) : 
  more columns than column names
In addition: Warning message:
closing unused connection 3 (http://navigatela.lacity.org/dot/traffic_data/manual_counts/46972_WESWEY96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/46974_KINWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/Gayley.Weyburn.180927-NDSMAN.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12599_GAYVET96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/gaylec06.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12690_BROLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12691_BROXTON.WEYBURN07.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12995_HILLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13531_LECWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13997_VETWIL081021.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/SELWIL080319.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/16896_GLELIN [... truncated]
'''

EDIT: Further examination of my code for creating a minimal example revealed other problems. I will update when I am farther along

Solution

  • I figured out why my code wasn't working. It turns out that the PDF's, although written in the same format, are read in by PDFTools differently, as there are several distinct spacing differences between the seemingly identical table/title formats. This is what causes the error message for the second part of the problem, as the number of rows I assumed in the function to read in the csv from the file is incorrect for these differently formatted PDFs.

    Thus, I need to determine which PDFs my function fails on, create a second or possibly more types of manual surveys, and adapt the function that reads in data from PDFS to work on all different types of manual surveys. Thanks for the suggestion from the comment of finding a minimum reproducible dataset: the process of running the function on toy datasets rather than the full one led me to discover this.