Search code examples
htmlrweb-scrapingdata-manipulationtibble

Placing "NA" into an Empty Position?


I am trying to scrape name/address information from yellowpages (https://www.yellowpages.ca/). I have a function (from :(R) Webscraping Error : arguments imply differing number of rows: 1, 0) that is able to retrieve this information:

library(rvest)
library(dplyr)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

However, sometimes the address information is not always present. For example : there are several barbers listed on this page https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON and they all have addresses except one of them. As a result, when I run this function, I get the following error:

scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")

Error:
! Tibble columns must have compatible sizes.
* Size 14: Existing data.
* Size 12: Column `address`.
i Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.

My Question: Is there some way that I can modify the definition of the "scraper" function in such a way, such that when no address is listed, an NA appears in that line? For example:

     barber    address
1 barber111 address111
2 barber222 address222
3 barber333         NA

Is there some way I could add a statement similar to CASE WHEN that would grab the address or place an NA when the address is not there?


Solution

  • In order to match the businesses with their addresses, it is best to find a root node for each listing and get the text from the relevant child node. If the child node is empty, you can add an NA

    library(rvest)
    library(dplyr)
    
    scraper <- function(url) {
    
     nodes <- read_html(url) %>% html_elements(".listing_right_section") 
    
      tibble(name = nodes %>% sapply(function(x) {
                 x <- html_text2(html_elements(x, css = ".jsListingName"))
                 if(length(x)) x else NA}),
             address = nodes %>% sapply(function(x) {
                 x <- html_text2(html_elements(x, css = ".listing__address--full"))
                 if(length(x)) x else NA}))
    }
    

    So now we can do:

    scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")
    #> # A tibble: 14 x 2
    #>    name                                      address                            
    #>    <chr>                                     <chr>                              
    #>  1 Lords'n Ladies Hair Design                1560 Lasalle Blvd, Sudbury, ON P3A~
    #>  2 Jo's The Lively Barber                    611 Main St, Lively, ON P3Y 1M9    
    #>  3 Hairapy Studio 517 & Barber Shop          517 Notre Dame Ave, Sudbury, ON P3~
    #>  4 Nickel Range Unisex Hairstyling           111 Larch St, Sudbury, ON P3E 4T5  
    #>  5 Ugo Barber & Hairstyling                  911 Lorne St, Sudbury, ON P3C 4R7  
    #>  6 Gordon's Hairstyling                      19 Durham St, Sudbury, ON P3C 5E2  
    #>  7 Valley Plaza Barber Shop                  5085 Highway 69 N, Hanmer, ON P3P ~
    #>  8 Rick's Hairstyling Shop                   28 Young St, Capreol, ON P0M 1H0   
    #>  9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3    
    #> 10 Pat's Hairstylists                        33 Godfrey Dr, Copper Cliff, ON P0~
    #> 11 WildRootz Hair Studio                     911 Lorne St, Sudbury, ON P3C 4R7  
    #> 12 Sleek Barber Bar                          324 Elm St, ON P3C 1V8             
    #> 13 Faiella Classic Hair                      <NA>                               
    #> 14 Ben's Barbershop & Hairstyling            <NA>
    

    Created on 2022-09-16 with reprex v2.0.2