Search code examples
rweb-scrapinghtml-tablerselenium

R: Capturing Strike-Through Text Using RSelenium


I am using RSelenium to scrape data tables from a website. Iterated through many pages using a loop.

The code below, successfully scrapes the table in question (albeit looses UTFC formatting), however in some cases entries in the table have a "strike-through", in which case the code to ignores the strike through and acts is if it is not there.

Example:

enter image description here but records in R as enter image description here

Could anyone please help with how I may retain the strike through information when I scrape the table?

My code scraping table:

Data_table_html <- remDr$getPageSource()[[1]] %>% 
                                  read_html() %>%
                                  html_table(header = FALSE, fill = TRUE)

I have spent hours on this, so any help or pointers would be immensely helpful,


Solution

  • I would like to share the solution I found, below. In short identifying nodes in HTML which have html_attr as "style" does the trick:

    saving <- html_nodes((remDr$getPageSource()[[1]]), xpath='your xpath') %>% html_attr("style") %>% gsub("text-decoration:line-through;", "0", .) #%>% html_table(fill=TRUE)