Search code examples
htmlrweb-scrapingrvest

Empty variables while attempting to scrape a table from a webpage using rvest


I am brand new to coding in r and am trying to scrape the following table into a data frame:

https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015

It should be fairly straightforward, but my variables have 0 observations and I'm not sure what I'm doing wrong. The code I used is:

library(tidyverse)
library(rvest)

#set the url of the website
url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")

#Scrape variables
rank <- url %>% html_nodes(".td:nth-child(1)") %>% html_text()
company <- url %>% html_nodes(".td:nth-child(2)") %>% html_text()
website <- url %>% html_nodes(".td~ td+ td") %>% html_text()

#Create dataframe
fortune500 <- data.frame(company,rank,website)

Was attempting to follow this walkthrough. Any help is much appreciated :)


Solution

  • You can do it by calling html_table() on url and picking the first element.

    library(tidyverse)
    library(rvest)
    url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")
    url %>% html_table() %>% pluck(1)
    #> # A tibble: 500 × 3
    #>     Rank Company            Website                  
    #>    <int> <chr>              <chr>                    
    #>  1     1 Walmart            www.walmart.com          
    #>  2     2 Exxon Mobil        www.exxonmobil.com       
    #>  3     3 Chevron            www.chevron.com          
    #>  4     4 Berkshire Hathaway www.berkshirehathaway.com
    #>  5     5 Apple              www.apple.com            
    #>  6     6 General Motors     www.gm.com               
    #>  7     7 Phillips 66        www.phillips66.com       
    #>  8     8 General Electric   www.ge.com               
    #>  9     9 Ford Motor         www.ford.com             
    #> 10    10 CVS Health         www.cvshealth.com        
    #> # … with 490 more rows
    

    Created on 2023-03-01 with reprex v2.0.2

    Alternatively, your original code will also work, you just need to remove the period from in front of td. The . identifies an object class, so you were trying to identify object of class td. Without the . in front, it will look for tags called td, which is what you want.

    library(tidyverse)
    library(rvest)
    url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")
    rank <- url %>% html_nodes("td:nth-child(1)") %>% html_text()
    company <- url %>% html_nodes("td:nth-child(2)") %>% html_text()
    website <- url %>% html_nodes("td~ td+ td") %>% html_text()
    fortune500 <- data.frame(company,rank,website)
    head(fortune500)
    #>              company rank                   website
    #> 1            Walmart    1           www.walmart.com
    #> 2        Exxon Mobil    2        www.exxonmobil.com
    #> 3            Chevron    3           www.chevron.com
    #> 4 Berkshire Hathaway    4 www.berkshirehathaway.com
    #> 5              Apple    5             www.apple.com
    #> 6     General Motors    6                www.gm.com
    

    Created on 2023-03-01 with reprex v2.0.2