Search code examples
rpurrrrvest

Using purrr's possibly over a list to convert empty tibble columns to NA values


I have a function which uses rvest to extract data from a webpage. The function is the following (which is not so important):

processCardPackMinimalRealtorInfo = function(rowPosition){
  # collect realtor information
  realEstateInformation = bind_cols(
    realEstateCompanyName = CardPackMinimal[rowPosition] %>% 
      html_elements('.re-CardPromotionLogo') %>% 
      html_nodes("a") %>% 
      html_children() %>% 
      html_attr("title"),
    
    realEstatePageLink = CardPackMinimal[rowPosition] %>% 
      html_elements('.re-CardPromotionLogo') %>% 
      html_nodes("a") %>% 
      html_attr('href') %>% 
      paste("https://www.fotocasa.es", ., sep = "")
  )
  return(realEstateInformation)
}

The function works well without "error" but when it encounters "no information" it returns a tibble of 0. So I tried to wrap this function into a purrr, possibly function to return NA values when the tibble is 0 but I cannot see to get the possibly function to return a dataframe of NA when there is no information.

possiblyProcessCardPackMinimalRealtorInfo = possibly(processCardPackMinimalRealtorInfo,
                                                     otherwise = tibble(
                                                       realEstateCompanyName = NA_character_,
                                                       realEstatePageLink = NA_character_
                                                     ))

My question is, how can I correct the possibly function to return NA values when the data collected does not exisit - i.e. the tibble is a 0 x 2 (in the case - the 2 columns are realEstateCompanyName and realEstatePageLink generated in the original function).

Apologies in advance for no dput or sample data, the data involved webscraping and takes a few hours to process.


Solution

  • The function processCardPackMinimalRealtorInfo should throw an error when no rows are output, so that this can be handled by possibly:

    library(tibble)
    library(purrr)
    
    data0 <- tibble(realEstateCompanyName = character(0),
                    realEstatePageLink = character(0))
    
    data1 <- tibble(realEstateCompanyName = "a",
                    realEstatePageLink = "b")
    
    
    processCardPackMinimalRealtorInfo <- function(data) { if (nrow(data)==0) stop('no rows');data}
    
    processCardPackMinimalRealtorInfo(data0)
    #> Error in processCardPackMinimalRealtorInfo(data0): no rows
    
    list(data1,data0) %>% map(possibly(processCardPackMinimalRealtorInfo,
                                       otherwise = tibble(
                                         realEstateCompanyName = NA_character_,
                                         realEstatePageLink = NA_character_
                                       )))
    #> [[1]]
    #> # A tibble: 1 × 2
    #>   realEstateCompanyName realEstatePageLink
    #>   <chr>                 <chr>             
    #> 1 a                     b                 
    #> 
    #> [[2]]
    #> # A tibble: 1 × 2
    #>   realEstateCompanyName realEstatePageLink
    #>   <chr>                 <chr>             
    #> 1 <NA>                  <NA>
    

    another possibility is to handle 0 row in the function itself:

    processCardPackMinimalRealtorInfo <- function(data) { 
      if (nrow(data)==0) data = tibble(realEstateCompanyName = NA_character_,
                                       realEstatePageLink = NA_character_)
      data
    }
    
    list(data1,data0) %>% map(processCardPackMinimalRealtorInfo)
    
    #> [[1]]
    #> # A tibble: 1 × 2
    #>   realEstateCompanyName realEstatePageLink
    #>   <chr>                 <chr>             
    #> 1 a                     b                 
    #> 
    #> [[2]]
    #> # A tibble: 1 × 2
    #>   realEstateCompanyName realEstatePageLink
    #>   <chr>                 <chr>             
    #> 1 <NA>                  <NA>