Search code examples
htmlrseleniumweb-scrapingrselenium

Web scraping in R with Selenium to click new pages


I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page. enter image description here

Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.

enter image description here

How can it be done? You help me a lot. Thanks!

This is my code:

#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of 
library(purrr) # for 'map_chr' to get reply 
library(tidyr) #extract_numeric(years)
library(stringr)

df_0<-data.frame(project=character(),
                 name=character(),
                 location=character(),
                 dates=character(),
                 objective=character(),
                 collected=character(),
                 donor=character(),
                 shares=character(),
                 follow=character(),
                 comments=character(),
                 category=character())

#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'

# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()

  
require(RSelenium)

# go to website
remDr$navigate(url)

# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
  
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
  
 # 2) name 
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
  
# 3) location 
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()

  
# 4) dates 
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
  
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
  
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
  
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
  
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
  
  
  
# create the df with all the info
review_data <- data.frame(project=project, 
                            name= gsub("\\Organizador.*","",info[7]),
                            location=str_remove(location[7], "Organizador"),
                            dates = dates, 
                            collected = unlist(strsplit(money, " "))[1], 
                            objective = unlist(strsplit(money, " "))[8], 
                            donor = popularity[1],
                            shares = popularity[2],
                            follow = popularity[3],
                            comments = extract_numeric(comments), 
                            category = category[17], 
                            stringsAsFactors = F)  


Solution

  • The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.

    In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.

    You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.

    N.B. Code can be re-factored a little as tidy up.

    library(httr)
    library(stringr)
    library(purrr)
    library(tidyverse)
    
    get_df <- function(x){
      df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>% 
        mutate( url = paste0('https://es.gofundme.com/f/', url))
      return(df)
    }
    
    r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
    
    matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
    
    application_id <- matches[[1]][,2]
    api_key <-matches[[1]][,3]
    
    headers = c(
      'User-Agent' = 'Mozilla/5.0',
      'content-type' = 'application/x-www-form-urlencoded',
      'Referer' = 'https://es.gofundme.com/'
    )
    
    params = list(
      'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
      'x-algolia-api-key' = api_key,
      'x-algolia-application-id' = application_id
    )  
    post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
    page_num <- 0
    data <- paste0(post_body, page_num, '"}]}')
    res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
    num_pages <- res$results[[1]]$nbPages
    df <- get_df(res$results[[1]]$hits)
    pages <- c(1:num_pages-1)
    
    df2 <- map_dfr(pages, function(page_num){
      data <- paste0(post_body, page_num, '"}]}')
      res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
      temp_df <-get_df(res$results[[1]]$hits)
    }
    )
    
    df <- rbind(df, df2)