Search code examples
rseleniumanchor-scroll

Scrape Anchored Website with Selenium Package in R


I am fairly new to R and am having trouble with pulling data from the Forbes website.

My current function is:

url =

http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states

data = readHTMLTable(url)

However, the Forbes website is anchored with the "#" symbol within the link. I downloaded the rselenium package in order to parse the data I want, but I am not well versed with reselenium.

Does anyone have any advice/expertise with reselenium and how I can pull the data from Forbes using reselenium? Ideally I want to pull data from page 1, 2, etc. from the website.

Thanks!


Solution

  • It's a little hacky, but here's my solution using rvest and read.delim...

    library(rvest)
    
    url <- "http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states"
    a <- html(url) %>%
      html_nodes("#thelist") %>%
      html_text()
    con <- textConnection(a)
    df <- read.delim(con, sep="\t", header=F, skip=12, stringsAsFactors=F)
    close(con)
    df$V1[df$V1==""] <- df$V3[df$V1==""]
    df$V2 <- df$V3 <- NULL
    df <- subset(df, V1!="")
    df$index <- 1:nrow(df)
    df2 <- data.frame(company=df$V1[df$index%%6==1],
                      country=df$V1[df$index%%6==2],
                      sales=df$V1[df$index%%6==3],
                      profits=df$V1[df$index%%6==4],
                      assets=df$V1[df$index%%6==5],
                      market_value=df$V1[df$index%%6==0])