Search code examples
rurldata-manipulationdata-cleaning

replace a url (or a string containing multiple urls) with a value returned from a function


we have a df like so:

df <- data.frame(id= c(1,2,3,4,5),
                 urls= c(NA,NA,"https://www.bing.com",
                         "https://www.bing.com https://www.google.com",
                         "https://github.com/"),
                 stringsAsFactors = FALSE)

Then we have a function that read in real urls, and get the title tag of each page. Like so-

get_title_tag <- function(url) {

  if (is.na(ifelse(url == "", NA, url))) {
    return(NA)
  }
  else if(identical(xml2::read_html(url), character(0))){
    return(NA)
  }
  else{
    page <- xml2::read_html(url)

    path_to_title <- "/html/head/title"

    conf_nodes <- rvest::html_nodes(page, xpath = path_to_title)

    title <- rvest::html_text(conf_nodes)

    #return(title)
   return ("PAGE_TITLE")
  }
}

The problem is that the element at 4th position at urls column contains two consecutive urls, so we get errors. We have looked at several posts here in the forums however none have problems like what We are facing.

Our goal is to get this output:

> df
  id                                          urls
1  1                                          <NA>
2  2                                          <NA>
3  3                                         PAGE_TITLE
4  4                              PAGE_TITLE PAGE_TITLE
5  5                                         PAGE_TITLE

I have tried this method that separates the urls, but expands the df which is not what I want:

urls_only_vector <- df %>%
                      mutate(urls= strsplit(as.character(urls), " ")) %>%
                      unnest(urls) #%>% select("urls")

Using this method I can read urls one at a time, but again, since it expands my dataframe, I was wondering if there is something else I can do? Can I get an hint please? I will cherish any help.


Solution

  • It is better to get url's in different rows, apply get_title_tag function get the title and combine the data again grouping by id so that size of data remains the same.

    library(dplyr)
    
    df %>%
      tidyr::separate_rows(urls, sep = '\\s+') %>%
      mutate(title = purrr::map_chr(urls, get_title_tag)) %>%
      group_by(id) %>%
      summarise(title = toString(title))