in order to check a list of people for anti money laundry check,
I tried to web scrape a site but I received:
"Error in check_form()
:
! form
must be a single form produced by html_form()"
my code is :
#install.packages("robotstxt")
library( robotstxt )
site<-"https://sanctionssearch.ofac.treas.gov/"
paths_allowed(site)
#install.packages("rvest")
library(rvest)
url <- "https://sanctionssearch.ofac.treas.gov/"
search_name <- "John Doe" # Replace with the name you're searching for
# Send a GET request to the URL
page <- read_html(url)
# Find the input field and submit button selectors
input_selector <- "input[name='ctl00$MainContent$txtLastName']"
button_selector <- "input[name='ctl00$MainContent$btnSearch']"
# Fill in the input field with the search name
page <- html_form_set(page, input_selector, value = search_name)
# Submit the search form
page <- html_form_submit(page, button_selector)
# Extract and process the results
results <- page %>%
html_nodes("div.resultName") %>%
html_text()
# Check if the search name appears in the results
name_found <- search_name %in% results
cat("Search name found:", name_found, "\n")
There's a couple issues with the code you've written above, most of which can be resolved by carefully reading the examples provided by rvest
for these functions. First of all, we need to actually extract the web form from within the html document obtained from read_html
with the html_form
function, of which it looks like we want the first one. Then, we need to specify the arguments to it using key-value pairs rather than the character equivalents of the selectors you've pulled out above. Finally, the form submission doesn't require the full specification of the button, just the name of the form element associated with it (ctl00$MainContent$btnSearch
, although here the default of NULL works just as well because it automatically chooses the first button). That all simplifies down to the code below:
page_form <- html_form(read_html("https://sanctionssearch.ofac.treas.gov/"))[[1]]
page_results <- page_form %>%
html_form_set(`ctl00$MainContent$txtLastName`="Johnson") %>%
html_form_submit(submit = "ctl00$MainContent$btnSearch")
Then, once we've got the result itself, we need to extract the content (the HTML itself) because html_nodes
doesn't work on a response
object (and has actually been deprecated in favor of html_element
). Finally, rather than processing it manually using the div tag you've provided, we can just extract the tables directly using html_table
and find the one that we're interested in. Last name of "Doe" didn't return any results for me so I'm using "Johnson" as an example instead:
read_html(page_results) %>% html_table()
[[6]]
# A tibble: 3 x 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <chr> <chr> <chr> <int>
1 JOHNSON, Prince Nimba County Individual GLOMAG SDN 100
2 JOHNSON, Prince Y. Nimba County Individual GLOMAG SDN 100
3 JOHNSON, Prince Yormie Nimba County Individual GLOMAG SDN 100
And as a final note, you may have better luck with just downloading and parsing the entire dataset rather than using the web lookup tool unless you really really need the absolute most up-to-date information... which you shouldn't be querying anyway because the docs say "It should not be utilized by automated systems that are configured to continually run searches through the tool."