I am looking to scrape data from this carrier link, I am using the rvest package in R and ive scraped some of the top information in the webpage by using this code below:
library(rvest)
url <- "https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true"
page <- read_html(url)
# Extract the table on the page
table <- page %>% html_nodes("table") %>% .[[2]] %>% html_table()
# Print the table
View(table)
Which yields this information:
However, I am looking to retrieve the information from the Tracing Information table in a tabular format instead:
Here's a mundate method:
library(rvest)
sess <- session("https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true")
html_table(sess)[[9]]
# # A tibble: 10 × 3
# Date Time Description
# <chr> <chr> <chr>
# 1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL
# 2 2022-06-24 04:22 Shipment arrived at destination Service Center TAMPA, FL
# 3 2022-06-24 03:02 Shipment departed ORLANDO Service Center
# 4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center
# 5 2022-06-22 22:54 Shipment departed DOTHAN Service Center
# 6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center
# 7 2022-06-21 10:36 Shipment departed HOUSTON Service Center
# 8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center
# 9 2022-06-20 19:59 Shipment departed WESLACO Service Center
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX
The use of [[9]]
was based on looking at all tables returned by html_table()
, there's nothing guaranteeing that number will persist.
A better method of finding a table is by looking for specific attributes/headers/names/ids, best found using the SelectorGadget.
A slightly more detailed look at the URL page reveals that the parent node of that table has class="tracingInformation"
, indicating we can do this:
html_element(sess, ".tracingInformation") %>%
html_children() %>%
html_table()
# [[1]]
# # A tibble: 10 × 3
# Date Time Description
# <chr> <chr> <chr>
# 1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL
# 2 2022-06-24 04:22 Shipment arrived at destination Service Center TAMPA, FL
# 3 2022-06-24 03:02 Shipment departed ORLANDO Service Center
# 4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center
# 5 2022-06-22 22:54 Shipment departed DOTHAN Service Center
# 6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center
# 7 2022-06-21 10:36 Shipment departed HOUSTON Service Center
# 8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center
# 9 2022-06-20 19:59 Shipment departed WESLACO Service Center
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX
The walkthrough on how I found that. I'm using Firefox, I'm confident other browser have the same or very similar keys/tabs/names.
F12
(or whatever key enters the browser's dev console).<table>
above the cell. If this doesn't have an unambiguous id=
or class=
(as in this example, I thought id="AAACooperMasterPage_bodyContent_grdViewTraceInfo"
was a bit obscure/automated), go up a little higher until you find a clear id=
or class=
. In this case, I found that the table we want is encased in another table with class="tracingInformation"
.html_element(..)
.