Search code examples
rrseleniumrcurl

How to extract all visible text from a webpage using R


I require the visible text of this page: https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred/

At first, I thought RSelenium would work. But I couldn't figure out how to get all of the text that is visible.

library("RSelenium")
library("rvest")

remDr <- remoteDriver(port = 4445L)
remDr$open()

remDr$navigate("https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred")
remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# or
remDr$findElement(using='css selector',"body")$getElementText()

next, I read about getURLContent

library("RCurl")
library("XML")

url <- "https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred"
x <- getURLContent(url)
x

but received this message when trying:

[1] "Found. Redirecting to /ca/en/credit-cards/simply-cash-preferred/"
attr(,"Content-Type")
                  charset 
"text/plain"      "utf-8" 

I'm not sure how to obtain content of this particular page using getURLContent.


Solution

  • As the page has a good deal of javascript, a combination of Rselenium, rvest, and htm2txt is helpful. The function htm2txt::htm2txt() will take care (i.e. parse out or remove) of lots of javascript formatting snippets that would be difficult to exclude using plain rvest.

    library(RSelenium)
    library(rvest)
    library(htm2txt)
    library(tidyverse)
    
    rD <- rsDriver(browser="firefox", port=4545L, verbose=TRUE)
    remDr <- rD[["client"]]
    
    remDr$navigate("https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred")
    
    captured_text <- 
      remDr$getPageSource()[[1]] %>% 
      read_html(encoding = "UTF-8") %>% 
      html_node(xpath = "//body") %>% 
      as.character() %>% 
      htm2txt::htm2txt()
    
    captured_text
    [1] "Skip to content\n\nMenuMenu\n\nThe following navigation element is controlled via arrow keys followed by tab\n\nMy Account\nMy Account\n\nPersonal Accounts\n\n• Account Summary\n\n• View Statement\n\n• Manage Account\n\n• Make a Payment\n\n• Manage Pre-Authorized Payment\n\n• Add Someone to Your Account\n\nBusiness Accounts\n\n• Business Account Summary\n\n• American Express @Work\n\n• Merchant Services\n\nOnline Services\n\n• Register for Online Services\n\n• Activate Your Card\n\n• American Express App\n\n• Manage Account Alerts\n\n• Sign Up for Email Offers\n\n• Online-Only Statements\n\nHelp & Support\n\n• Forgot User ID or Password?\n\n• Support 24/7\n\n• Welcome Centre\n\n• Ways to Pay\n\n• Security Centre\n\nCanadaChange Country\n\nEnglish\n\n• Français\n\nCards\nCards\n\nPersonal Cards\n\n• View All Cards\n\n• Cash Back Credit Cards\n\n• Flexible Rewards Cards\n\n• No Annual Fee Cards\n\n• Co-Branded Cards\n\n• Travel Cards\n\nFeatured Cards\n\n• The American Express Aeroplan Reserve Card\n\n• The Cobalt Card\n\n• The SimplyCash Preferred Card\n\n• The Choice Card..."