I am fairly new to R and am having trouble with pulling data from the Forbes website.
My current function is:
url =
http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states
data = readHTMLTable(url)
However, the Forbes website is anchored with the "#" symbol within the link. I downloaded the rselenium package in order to parse the data I want, but I am not well versed with reselenium.
Does anyone have any advice/expertise with reselenium and how I can pull the data from Forbes using reselenium? Ideally I want to pull data from page 1, 2, etc. from the website.
Thanks!
It's a little hacky, but here's my solution using rvest and read.delim...
library(rvest)
url <- "http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states"
a <- html(url) %>%
html_nodes("#thelist") %>%
html_text()
con <- textConnection(a)
df <- read.delim(con, sep="\t", header=F, skip=12, stringsAsFactors=F)
close(con)
df$V1[df$V1==""] <- df$V3[df$V1==""]
df$V2 <- df$V3 <- NULL
df <- subset(df, V1!="")
df$index <- 1:nrow(df)
df2 <- data.frame(company=df$V1[df$index%%6==1],
country=df$V1[df$index%%6==2],
sales=df$V1[df$index%%6==3],
profits=df$V1[df$index%%6==4],
assets=df$V1[df$index%%6==5],
market_value=df$V1[df$index%%6==0])