Search code examples
htmlcssrrvest

rvest to scrape images


I've worked on this for couple of weeks without success. My long term goal is to scrape each image from the following website (link:https://bioguide.congress.gov/search). For starters, I'm trying to get just one location of the image stored in the 'img alt' property of the html code.

The html code shows this

<div class="l-grid__item l-grid__item--3/12 l-grid__item--12/12@mobile--sm l-grid__item--4/12@desktop l-grid__item--6/12@tablet"><div tabindex="0" class="c-card u-flex u-flex--column u-height--100% u-cursor--pointer u-bxs--dark-lg:hover c-card--@print"><div class="u-height--100% u-width--100% u-p u-flex u-flex--centered u-mb--auto"><div aria-hidden="true" class="u-max-width--80% u-max-height--250px"><img alt="/photo/66c88d1d7401a93215e0b225.jpg" class="u-max-height--250px u-height--auto u-width--auto u-block" src="/photo/66c88d1d7401a93215e0b225.jpg"></div></div><div class="u-flex u-flex--column u-flex--no-shrink u-p u-bg--off-white u-fw--bold u-color--primary u-text--center u-bt--light-gray"><div class="u-cursor--pointer u-mb--xs">AANDAHL, Fred George</div><div class="u-fz--sm u-fw--semibold">1897 – 1966</div></div></div></div>

I used the following R code but I get character(0)

library(httr)
library(rvest)

# Fetch the HTML content with a custom User-Agent
response <- GET("https://bioguide.congress.gov/search", 
                user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"))

# Parse the content
page <- read_html(content(response, as = "text", encoding = "UTF-8"))

# Navigate to the div with class starting with 'l-grid__item' and extract img alt attributes
img_alt_values <- page %>%
  html_nodes(xpath = "//div[starts-with(@class, 'l-grid__item')]") %>%
  html_nodes(xpath = ".//img") %>%
  html_attr("alt")

Can anyone suggest how I get past this?


Solution

  • Having a look at the network traffic it can be seen that the data is returned from an API where the page's search function generates a POST request with a JSON payload. We can use httr2 to make these requests and return up to 100 records at a time, although to make things more minimal I limit each request to 3 records in the code below.

    The url and payload are:

    library(httr2)
    library(jsonlite)
    library(tidyverse)
    
    # API address
    url <- "https://app-elastic-prod-eus2-001.azurewebsites.net/search"
    
    # JSON payload  
    payload_string <- r"({
        "index": "bioguideprofiles",
        "aggregations": [
            {
                "field": "jobPositions.congressAffiliation.congress.name",
                "subFields": [
                    "jobPositions.congressAffiliation.congress.startDate",
                    "jobPositions.congressAffiliation.congress.endDate"
                ]
            },
            {
                "field": "jobPositions.congressAffiliation.partyAffiliation.party.name"
            },
            {
                "field": "jobPositions.job.name"
            },
            {
                "field": "jobPositions.congressAffiliation.represents.regionCode"
            }
        ],
        "size": 12,
        "from": 0,
        "sort": [
            {
                "_score": true
            },
            {
                "field": "unaccentedFamilyName",
                "order": "asc"
            },
            {
                "field": "unaccentedGivenName",
                "order": "asc"
            },
            {
                "field": "unaccentedMiddleName",
                "order": "asc"
            }
        ],
        "keyword": "",
        "filters": {
    
        },
        "matches": [
    
        ],
        "searchType": "OR",
        "applicationName": "bioguide.house.gov"
    }
    )"
    

    We need to convert the payload to an R list so we can easily modify the from argument in the request with req_body_json_modify():

    # Convert to R list
    payload_list <- fromJSON(payload_string)
    
    # Get n records of first x records
    request_size <- 3L         # 100 max per request
    total_records <- 15L       # 12953 records in database
    from <- seq(1L, total_records, request_size) - 1L  # Sequence of starting positions
    
    # Generate base request
    req <- request(url) |>
        req_method("POST") |>
        req_body_json(payload_list) 
    
    # Generate list of requests (5 requests of 3 records each)
    requests <- from |> 
       lapply(\(n) req |> req_body_json_modify(from = n, size = request_size))
    
    # Execute requests
    responses <- req_perform_sequential(requests, on_error = "return")
    
    # Parse responses and extract image URL
    results <- resps_data(
      responses,
      \(r) r |>
        resp_body_json(simplifyDataFrame = TRUE) |>
        pluck("filteredHits")  |>
        select(starts_with("unaccented"), any_of("image"))
      ) |>
      bind_rows() |>
      hoist("image", "contentUrl") |> 
      select(-image) |> 
      mutate(image_url = ifelse(is.na(contentUrl), NA, paste0("https://bioguide.congress.gov/photo/", basename(contentUrl))), .keep = "unused") |> 
      as_tibble()
    

    Where results contains the derived image URLs:

    # A tibble: 15 × 4
       unaccentedFamilyName unaccentedGivenName unaccentedMiddleName image_url                                       
       <chr>                <chr>               <chr>                <chr>                                           
     1 Aandahl              Fred                George               https://bioguide.congress.gov/photo/66c88d1d740…
     2 Abbitt               Watkins             Moorman              https://bioguide.congress.gov/photo/ad79716f164…
     3 Abbot                Joel                NA                   NA                                              
     4 Abbott               Amos                NA                   NA                                              
     5 Abbott               Joseph              Carter               https://bioguide.congress.gov/photo/39253c461f2…
     6 Abbott               Joseph              NA                   https://bioguide.congress.gov/photo/43ba0fd5299…
     7 Abbott               Josiah              Gardner              https://bioguide.congress.gov/photo/470dc5df4ba…
     8 Abbott               Nehemiah            NA                   NA                                              
     9 Abdnor               James               NA                   https://bioguide.congress.gov/photo/a32ba2ea44f…
    10 Abel                 Hazel               Hempel               https://bioguide.congress.gov/photo/07f3a896ce1…
    11 Abele                Homer               E.                   https://bioguide.congress.gov/photo/a58aa67c32f…
    12 Abercrombie          James               NA                   NA                                              
    13 Abercrombie          John                William              https://bioguide.congress.gov/photo/76a90e5795f…
    14 Abercrombie          Neil                NA                   https://bioguide.congress.gov/photo/66cbb14989f…
    15 Abernethy            Charles             Laban                https://bioguide.congress.gov/photo/00ff9ca93d0…
    

    There is a heap of other data returned with each query but will leave how to wrangle it all to you.