Search code examples
rtext-parsinghttr

How to properly pass quoted strings as part of URL input to httr:GET()?


Suppose I wish to pass a URL like so to httr::GET():

https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"

How would I go about getting the quoted portion of this string (i.e., "dna+methyltransferase") passed as input correctly? My input URL string is stored as follows, and passing it directly does not work as the escaped double quotes are not being evaluated:

> urlinp <- "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
> status_code(GET(urlinp))
# [1] 400

The one idea I had was to use capture.output() with cat() to try and pass the (parsed) string, but that didn't work either:

> status_code(GET(capture.output(cat(urlinp))))
[1] 400

I frankly don't know how to do this. Googling did not really help (or I was searching with inappropriate terms). Any pointers would be much appreciated.

Edit: updated context below.

So, I basically have a small function that takes two strings SoughtProtein and SoughtTaxon as inputs, and formulates a URL query (?) out of it as shown below.

UniProtQueryConstructor <- function(SoughtProtein = NULL, SoughtTaxon = NULL){

  #Function constants
  tmpUniProtBaseURL <- "https://www.uniprot.org/uniprot/"
  tmpUniProtURLRetFormat <- "&format=tab"

  #Formatting steps below
  if(!is.null(SoughtProtein)){


    #If protein name has more than one word (e.g., "DNA methyltrasferase"), then having that string enclosed in double quotes

    if(stringr::str_detect(SoughtProtein, "\\s")){

      #Lowercaseing the string, and replaceing punctuation with "+"
      innertmpProtName <- stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\\s]+"), "+")

      #Enclosing the multi-word string in double quotes
      innertmpProtName <- paste0('\"', innertmpProtName, '\"')

      #Writing it to a temporary variable that will be passed on for final URL assembly
      tmpProtName <- paste0("name%3A", innertmpProtName)

    } else{

      #Else condition is a simple case, since there is no multi-word string to be dealt with

      tmpProtName <- paste0("name%3A", stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\\s]+"), "+"))

    }

  } else{ 

    #Else assign empty string to protin name if user input is non-existent

    tmpProtName <- ""

  }

  #Input string prep for taxon selection
  if(!is.null(SoughtTaxon)){

    tmpTaxon <- paste0("taxonomy%3A", stringr::str_replace_all(paste0(tolower(SoughtTaxon)), regex("[[:punct:]\\s]+"), "+"))

  } else{

    tmpTaxon <- ""

  }


  #Combining user inputs into once single string
  tmpInpTermList <- c(tmpProtName, tmpTaxon)


  #Preparing query string
  tmpAssembledUniProtQuery <- paste0("?query=", paste(tmpInpTermList[which(nchar(tmpInpTermList) > 0)], sep = "", collapse = "+AND+"))


  #Full query URL
  tmpFullUniProtSearchURL <- paste0(tmpUniProtBaseURL, tmpAssembledUniProtQuery, tmpUniProtURLRetFormat)

  return(tmpFullUniProtSearchURL)
}

#Test case below

TestSearch <- UniProtQueryConstructor(SoughtProtein = "DNA methyltransferase", SoughtTaxon = "Eukaryota")

#Double quotes within the string not dealt with properly.
TestSearch

# [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\"+AND+taxonomy%3Aeukaryota&format=tab"

The problem is that this function needs to be able to handle inputs where the input strings contain more than one word separated by a space (e.g. "DNA methyltransferse") by having them enclosed in double quotes within the query string as follows:

query=name%3A"dna+methyltransferase"

And this is where I'm running into my problem, in that I'm unable to have the escaped double quotes show up properly (as can be seen in the sample output).

I've written this update this just as the multiple answers with URLencode() arrived. I think the proposed solutions solve the problem at hand (of parsing the string properly), and also slightly alleviate the problem at large (of me being terrible at writing code; I learned something new today!).


Solution

  • I tried to find posts that covered this already, but there's a little detail here that threw me off. You can use utils::URLencode the encode the URL so that the quotation marks will be replaced with their percent-encoded equivalents.

    URLencode has an argument repeated, which defaults to false:

    repeated—logical: should apparently already-encoded URLs be encoded again?

    An ‘apparently already-encoded URL’ is one containing %xx for two hexadecimal digits.

    Your URL already has one piece encoded with %3A, the encoded version of :; because an encoded substring already exists, no further encoding is done by default. Instead, set repeated = FALSE, and the quotation marks get encoded as well:

    library(httr)
    
    urlinp <- 'https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"'
    
    URLencode(urlinp, repeated = FALSE)
    #> [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
    URLencode(urlinp, repeated = TRUE)
    #> [1] "https://www.uniprot.org/uniprot/?query=name%253A%22dna+methyltransferase%22"
    
    status_code(GET(URLencode(urlinp, repeated = TRUE)))
    #> [1] 200