Search code examples
rweb-scrapingposthttr

Identifying why scraping a website only works for POST request body strings and not others


I am looking to scrape publicly available tables from New York's electricity grid at this url: http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do

I'm able to do so for summer seasons but not for winter seasons. It's not clear to me what I'm missing so I'm hoping someone smarter can chime in.

Below is my process, starting with a screenshot of the page.

Nyiso inspect

Circled in red above, one must select a combination of Season & Month and hit Display to generate the tables. I have copied the request header info, including the url-encoded payload that I've included as the body of the POST request.

# libraries
library(jsonlite) 
library(lubridate)
library(data.table)
library(httr)
library(rvest)

# get session and cookies
initial_url <- "http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do"
initial_response <- GET(initial_url)
cookie_data <- cookies(initial_response)
cookie_string <- paste0(cookie_data$name, "=", cookie_data$value, collapse = "; ")


# Define the POST request headers, including cookies
headers <- c(
  "Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
  "Accept-Encoding" = "gzip, deflate",
  "Accept-Language" = "en-US,en;q=0.9",
  "Cache-Control" = "max-age=0",
  "Connection" = "keep-alive",
  "Content-Length" = "85",
  "Content-Type" = "application/x-www-form-urlencoded",
  "Cookie" = cookie_string,
  "Host" = "icap.nyiso.com",
  "Origin" = "null",
  "Upgrade-Insecure-Requests" = "1",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
)

# Define the URL for the POST request
post_url <- "http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do"

# Below is working code for a "Summer" season:
response <- POST(post_url, add_headers(.headers = headers), encode = "form", 
                 body = "seasonId=702793&seasonId=Summer+2024&month=05%2F2024&month=May%2F2024&display=Display")
html_content <- content(response, as = "text")
html <- read_html(html_content)
tables <- html %>% html_nodes("table")
html_table(tables[4]) # print
#[[1]]
## A tibble: 45 × 2
#   X1                           X2            
#   <chr>                        <chr>         
# 1 "05/2024"                    "05/2024"     
# 2 "G-J Locality"               "G-J Locality"
# 3 "Awarded Deficiency (MW)"    "1,888.1"     
# 4 "Awarded Excess (MW)"        "1,694.400"   
# 5 "% Excess Above Requirement" "14.78"       
# 6 "Price ($/kW-M)"             "$4.27"       
# 7 ""                           ""            
# 8 "LI"                         "LI"          
# 9 "Awarded Deficiency (MW)"    "176.9"       
#10 "Awarded Excess (MW)"        "519.200"     
## ℹ 35 more rows
## ℹ Use `print(n = ...)` to see more rows

Strangely enough, this process does not work if I change the body for the winter season, as indicated by inspecting the network. Any idea what I might be missing?

nyiso inspect 2

# does not work to generate the data
response <- POST(post_url, add_headers(.headers = headers), encode = "form", 
                 body = "seasonId=702409&seasonId=Winter+2023-2024&month=02%2F2024&month=Feb%2F2024&display=Display")
html_content <- content(response, as = "text")
html <- read_html(html_content)
tables <- html %>% html_nodes("table")
html_table(tables[4]) # there is no such table
   

A few odd behaviors I've noticed:

  • There are duplicate body parameters, but it will not work if I remove any one of them.
  • You can change the seasonId number (seasonId=702793) to any existing ID as long as the string (seasonId=Summer+2024) is correct. Other IDs are inside of here http://icap.nyiso.com/ucap/rest/seasons/public)

I could not locate a specific public rest api for the actual data in the table either.

Thanks for your time and thoughts.

And here are a bunch of body strings I've used to determine that this is an issue only for winter seasons:

body_strings <- c("seasonId=700085&seasonId=Winter+2021-2022&month=01%2F2022&month=Jan%2F2022&display=Display", 
"seasonId=700085&seasonId=Winter+2021-2022&month=02%2F2022&month=Feb%2F2022&display=Display", 
"seasonId=700085&seasonId=Winter+2021-2022&month=03%2F2022&month=Mar%2F2022&display=Display", 
"seasonId=700085&seasonId=Winter+2021-2022&month=04%2F2022&month=Apr%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=05%2F2022&month=May%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=06%2F2022&month=Jun%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=07%2F2022&month=Jul%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=08%2F2022&month=Aug%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=09%2F2022&month=Sep%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=10%2F2022&month=Oct%2F2022&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=11%2F2022&month=Nov%2F2022&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=12%2F2022&month=Dec%2F2022&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=01%2F2023&month=Jan%2F2023&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=02%2F2023&month=Feb%2F2023&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=03%2F2023&month=Mar%2F2023&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=04%2F2023&month=Apr%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=05%2F2023&month=May%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=06%2F2023&month=Jun%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=07%2F2023&month=Jul%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=08%2F2023&month=Aug%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=09%2F2023&month=Sep%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=10%2F2023&month=Oct%2F2023&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=11%2F2023&month=Nov%2F2023&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=12%2F2023&month=Dec%2F2023&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=01%2F2024&month=Jan%2F2024&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=02%2F2024&month=Feb%2F2024&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=03%2F2024&month=Mar%2F2024&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=04%2F2024&month=Apr%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=05%2F2024&month=May%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=06%2F2024&month=Jun%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=07%2F2024&month=Jul%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=08%2F2024&month=Aug%2F2024&display=Display"
)

Solution

  • Your problem is that you are specifying a content length in the header which you are not then honoring in your content string ("Winter 2023-2024" is longer than "Summer 2023").

    Part of the problem here is that you are over-specifying the request, which makes it harder to debug. You don't need the initial GET request, or cookies, or user agent, or most of the other headers.

    The following is fully reproducible in a clean session

    library(httr)
    library(rvest)
    
    headers <- c(`Connection` = "keep-alive",
                 `Content-Type`  = "application/x-www-form-urlencoded",
                 `Upgrade-Insecure-Requests` = "1")
    
    POST("http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do", 
         body = paste0("seasonId=702409",
                       "&seasonId=Winter+2023-2024",
                       "&month=02%2F2024",
                       "&month=Feb%2F2024",
                       "&display=Display"), 
         add_headers(.headers = headers)) %>%
      content(as = "text") %>%
      read_html() %>% 
      html_nodes("table") %>%
      getElement(4) %>%
      html_table()
    #> # A tibble: 45 x 2
    #>    X1                           X2            
    #>    <chr>                        <chr>         
    #>  1 "02/2024"                    "02/2024"     
    #>  2 "G-J Locality"               "G-J Locality"
    #>  3 "Awarded Deficiency (MW)"    "2,620.8"     
    #>  4 "Awarded Excess (MW)"        "1,748.600"   
    #>  5 "% Excess Above Requirement" "14.16"       
    #>  6 "Price ($/kW-M)"             "$4.56"       
    #>  7 ""                           ""            
    #>  8 "LI"                         "LI"          
    #>  9 "Awarded Deficiency (MW)"    "42.8"        
    #> 10 "Awarded Excess (MW)"        "859.700"     
    #> # i 35 more rows