Suppose I wish to pass a URL like so to httr::GET()
:
https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"
How would I go about getting the quoted portion of this string (i.e., "dna+methyltransferase") passed as input correctly? My input URL string is stored as follows, and passing it directly does not work as the escaped double quotes are not being evaluated:
> urlinp <- "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
> status_code(GET(urlinp))
# [1] 400
The one idea I had was to use capture.output()
with cat()
to try and pass the (parsed) string, but that didn't work either:
> status_code(GET(capture.output(cat(urlinp))))
[1] 400
I frankly don't know how to do this. Googling did not really help (or I was searching with inappropriate terms). Any pointers would be much appreciated.
Edit: updated context below.
So, I basically have a small function that takes two strings SoughtProtein
and SoughtTaxon
as inputs, and formulates a URL query (?) out of it as shown below.
UniProtQueryConstructor <- function(SoughtProtein = NULL, SoughtTaxon = NULL){
#Function constants
tmpUniProtBaseURL <- "https://www.uniprot.org/uniprot/"
tmpUniProtURLRetFormat <- "&format=tab"
#Formatting steps below
if(!is.null(SoughtProtein)){
#If protein name has more than one word (e.g., "DNA methyltrasferase"), then having that string enclosed in double quotes
if(stringr::str_detect(SoughtProtein, "\\s")){
#Lowercaseing the string, and replaceing punctuation with "+"
innertmpProtName <- stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\\s]+"), "+")
#Enclosing the multi-word string in double quotes
innertmpProtName <- paste0('\"', innertmpProtName, '\"')
#Writing it to a temporary variable that will be passed on for final URL assembly
tmpProtName <- paste0("name%3A", innertmpProtName)
} else{
#Else condition is a simple case, since there is no multi-word string to be dealt with
tmpProtName <- paste0("name%3A", stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\\s]+"), "+"))
}
} else{
#Else assign empty string to protin name if user input is non-existent
tmpProtName <- ""
}
#Input string prep for taxon selection
if(!is.null(SoughtTaxon)){
tmpTaxon <- paste0("taxonomy%3A", stringr::str_replace_all(paste0(tolower(SoughtTaxon)), regex("[[:punct:]\\s]+"), "+"))
} else{
tmpTaxon <- ""
}
#Combining user inputs into once single string
tmpInpTermList <- c(tmpProtName, tmpTaxon)
#Preparing query string
tmpAssembledUniProtQuery <- paste0("?query=", paste(tmpInpTermList[which(nchar(tmpInpTermList) > 0)], sep = "", collapse = "+AND+"))
#Full query URL
tmpFullUniProtSearchURL <- paste0(tmpUniProtBaseURL, tmpAssembledUniProtQuery, tmpUniProtURLRetFormat)
return(tmpFullUniProtSearchURL)
}
#Test case below
TestSearch <- UniProtQueryConstructor(SoughtProtein = "DNA methyltransferase", SoughtTaxon = "Eukaryota")
#Double quotes within the string not dealt with properly.
TestSearch
# [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\"+AND+taxonomy%3Aeukaryota&format=tab"
The problem is that this function needs to be able to handle inputs where the input strings contain more than one word separated by a space (e.g. "DNA methyltransferse") by having them enclosed in double quotes within the query string as follows:
query=name%3A"dna+methyltransferase"
And this is where I'm running into my problem, in that I'm unable to have the escaped double quotes show up properly (as can be seen in the sample output).
I've written this update this just as the multiple answers with URLencode()
arrived. I think the proposed solutions solve the problem at hand (of parsing the string properly), and also slightly alleviate the problem at large (of me being terrible at writing code; I learned something new today!).
I tried to find posts that covered this already, but there's a little detail here that threw me off. You can use utils::URLencode
the encode the URL so that the quotation marks will be replaced with their percent-encoded equivalents.
URLencode
has an argument repeated
, which defaults to false:
repeated—logical: should apparently already-encoded URLs be encoded again?
An ‘apparently already-encoded URL’ is one containing %xx for two hexadecimal digits.
Your URL already has one piece encoded with %3A
, the encoded version of :
; because an encoded substring already exists, no further encoding is done by default. Instead, set repeated = FALSE
, and the quotation marks get encoded as well:
library(httr)
urlinp <- 'https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"'
URLencode(urlinp, repeated = FALSE)
#> [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
URLencode(urlinp, repeated = TRUE)
#> [1] "https://www.uniprot.org/uniprot/?query=name%253A%22dna+methyltransferase%22"
status_code(GET(URLencode(urlinp, repeated = TRUE)))
#> [1] 200