Search code examples
rseleniumweb-scrapingrvest

Scraping webpage when filter does not change URL


I would like to scrape http://csla.history.ox.ac.uk/search.php after applying a filter as follows

  1. clicking on 'Saint'
  2. selecting 'Gaul and Frankish kingdoms' under 'Region of Birth/Burial'
  3. clicking on 'Apply Search'

I struggle as the URL does not get updated accordingly.

The source code with the <option value="Gaul">Gaul and Frankish kingdoms</option>looks as follows

<div class="section colm colm6" id="fl-page4-12">
<label for="item_12"class="field-label">Region of Birth/Burial</label>
<label class="field select">
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>

From the selected webpage, I then would like to access the IDs that are written in blue, i.e. the first one would be E06478.


Solution

  • This is a tricky one. You need to POST the query to the server, and the query needs to be in a very particular format. You can get the html from the page like this:

    library(httr)
    library(rvest)
    
    items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
    contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
                  '\n\n', '\n\n')
    s <- paste0("-----------------------------39565121210000504382566389445\n",
           "Content-Disposition: form-data; name=\"form[item_", items,
           ']\"\n', contents,
           collapse = '')
    s <- paste0(s, '-----------------------------39565121210000504382566389445--')
    
    type <- paste0('multipart/form-data; boundary=---------------------------',
                   '39565121210000504382566389445')
    
    res <- POST('http://csla.history.ox.ac.uk/results.php',
               body = charToRaw(s),
               content_type(type))
    

    To get all the results in a neat data frame, you can then do:

    df <- res %>% 
      read_html() %>% 
      html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>% 
      html_text() %>% 
      matrix(ncol = 2, byrow = TRUE) %>% 
      as.data.frame() %>% 
      setNames(c('ID', 'Title')) %>% 
      dplyr::as_tibble()
    

    This gets you all the reference IDs in a data frame. To get the actual pages, we use these as query strings:

    urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)
    

    Now we need to go through all 900+ pages to extract the tabular data. It's safest to do this in a loop then bind the list together at the end:

    all_results <- list()
    
    for(i in seq_along(urls)) {
      all_results[[i]] <- read_html(urls[i]) %>% 
                           html_elements("td") %>% 
                           html_text() %>%
                           matrix(ncol = 4, byrow = TRUE) %>%
                           as.data.frame() %>%
                           setNames(c("ID", "Name", "Name_in_source", "Identity"))
    }
    
    final_result <- dplyr::bind_rows(all_results)
    

    The final result is now a data frame with over 3000 rows. Here are the first 3:

    head(final_result, 3)
    #>       ID                                       Name Name_in_source Identity
    #> 1 S01319          Orientius, bishop of Auch, 5th c.                 Certain
    #> 2 S02351 Mamertus, bishop of Vienne (Gaul), ob. 475                 Certain
    #> 3 S00316                            Martyrs of Lyon                 Certain
    

    Some of the IDs are duplicates since they appear in multiple pages. You could use unique to remove these. Note also that when you are printing a data frame to the console, Greek letters will appear as Unicode escape sequences. The text is still there in the underlying vector though. For example:

    head(final_result[3])
    #>                                                       Name_in_source
    #> 1                                                                   
    #> 2                                                                   
    #> 3                                                                   
    #> 4                                                                   
    #> 5 <U+03A0><U+03BF><U+03BB><U+03CD><U+03BA>a<U+03C1>p<U+03BF><U+03C2>
    #> 6           <U+03A0><U+03B9><U+03CC><U+03BD><U+03B9><U+03BF><U+03C2>
    

    But

    final_result[1:6, 3]
    #> [1] ""          ""          ""          ""          "Πολύκαρπος" "Πιόνιος"