Search code examples
htmlcssrweb-scrapingrvest

How to web scrape table element using rvest?


I am looking to scrape data from this carrier link, I am using the rvest package in R and ive scraped some of the top information in the webpage by using this code below:

library(rvest)

url <- "https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true"
page <- read_html(url)

# Extract the table on the page
table <- page %>% html_nodes("table") %>% .[[2]] %>% html_table()

# Print the table
View(table)

Which yields this information: pic1

However, I am looking to retrieve the information from the Tracing Information table in a tabular format instead: pic2


Solution

  • Here's a mundate method:

    library(rvest)
    sess <- session("https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true")
    html_table(sess)[[9]]
    # # A tibble: 10 × 3
    #    Date       Time  Description                                               
    #    <chr>      <chr> <chr>                                                     
    #  1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL                   
    #  2 2022-06-24 04:22 Shipment arrived at destination Service Center   TAMPA, FL
    #  3 2022-06-24 03:02 Shipment departed ORLANDO Service Center                  
    #  4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center                
    #  5 2022-06-22 22:54 Shipment departed DOTHAN Service Center                   
    #  6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center                 
    #  7 2022-06-21 10:36 Shipment departed HOUSTON Service Center                  
    #  8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center                
    #  9 2022-06-20 19:59 Shipment departed WESLACO Service Center                  
    # 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX            
    

    The use of [[9]] was based on looking at all tables returned by html_table(), there's nothing guaranteeing that number will persist.

    A better method of finding a table is by looking for specific attributes/headers/names/ids, best found using the SelectorGadget.

    A slightly more detailed look at the URL page reveals that the parent node of that table has class="tracingInformation", indicating we can do this:

    html_element(sess, ".tracingInformation") %>%
      html_children() %>%
      html_table()
    # [[1]]
    # # A tibble: 10 × 3
    #    Date       Time  Description                                               
    #    <chr>      <chr> <chr>                                                     
    #  1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL                   
    #  2 2022-06-24 04:22 Shipment arrived at destination Service Center   TAMPA, FL
    #  3 2022-06-24 03:02 Shipment departed ORLANDO Service Center                  
    #  4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center                
    #  5 2022-06-22 22:54 Shipment departed DOTHAN Service Center                   
    #  6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center                 
    #  7 2022-06-21 10:36 Shipment departed HOUSTON Service Center                  
    #  8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center                
    #  9 2022-06-20 19:59 Shipment departed WESLACO Service Center                  
    # 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX            
    

    The walkthrough on how I found that. I'm using Firefox, I'm confident other browser have the same or very similar keys/tabs/names.

    1. Open that url in a browser.
    2. Once loaded, hit F12 (or whatever key enters the browser's dev console).
    3. Select "Pick an element" and select a cell in the table you want. (In FF, this is a small button to the left of "Inspector".)
    4. Find the first reference to <table> above the cell. If this doesn't have an unambiguous id= or class= (as in this example, I thought id="AAACooperMasterPage_bodyContent_grdViewTraceInfo" was a bit obscure/automated), go up a little higher until you find a clear id= or class=. In this case, I found that the table we want is encased in another table with class="tracingInformation".
    5. Use that in html_element(..).