google-sheets google-sheets-api googlesheets4

How to read hyperlinks from range_read_cells() in googlesheets4

I need to read the urls from hyperlinks in Google sheets (where applicable) to be used in other parts of my analysis. I was reading my data with read_sheet() or range_read(), but that only gives me the value of the cell (what's visible - no formulae or urls). I turned to range_read_cells() and had some success, but now I'm stuck. Consider the following:

library(googledrive)
library(googlesheets4)
library(dplyr)

df_test <- read_sheet("https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0")

View(df_test) # no urls - text only

df_test2 <- range_read_cells(ss = "https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0",
                             sheet = "Sheet1", 
                             range = "A1:C5", 
                             cell_data = "full")

View(df_test2) # urls are in a deep within a list called 'cell'

x <-unlist(df_test2$cell)
x["hyperlink"]

The x["hyperlink"] call lets me access the url, but only the first one:

> x["hyperlink"]
                hyperlink 
"https://www.google.com/"

How can I apply this to the whole sheet? I need to read each url from the sheet and write them a new column. Note that not all rows will have a url, but I assume I can handle that case conditionally.

I've shared a Google sheet with sample data at https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0. It's public, so you should be able to see it.

Thanks in advance!

Solution

I was trying to solve this exact problem today. I wrote the following function, which can be used in conjunction with purrr::map_df to do what you're looking to do:

library(googlesheets4)
library(dplyr)
library(purrr)
library(stringr)

extract_hyperlink_data <- function(cell) {
  cell_unlisted <- unlist(cell)
  list(
    text = cell_unlisted[["formattedValue"]],
    hyperlink = if ("hyperlink" %in% names(cell_unlisted)) cell_unlisted[["hyperlink"]] else NA
  )
}

df_test <- read_sheet("https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0")

df_test2 <- range_read_cells(ss = "https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0",
                             sheet = "Sheet1", 
                             range = "A1:C5", 
                             cell_data = "full")
URL_data <- df_test2 %>%
  filter(stringr::str_starts(loc, "C") & loc != "C1") %>%
  pull(cell) %>%
  purrr::map_df(., extract_hyperlink_data) %>%
  rename(Website = text, Website_URL = hyperlink)

df_test %>%
  left_join(URL_data) %>%
  View

Hope this helps - let me know if you need to troubleshoot. You could probably build a custom parser to use in conjunction with this, so that you don't need to know which columns have hyperlinks before you start, but I haven't done that yet.