Search code examples
regexrstrsplit

Custom Function for obtaining URL directory


Looks quite easy,

Consider the following URLs,

[1] "scripts.iucr.org/cgi-bin/paper?S1600536812045886"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[2] "cpa-seoadvisors.com/cvv/auth/auth/view/pdf/index.html/"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[3] "www.scirp.org/journal/PaperDownload.aspx?DOI=10.4236/csta.2012.13014"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[4] "www.google.com.cy/search?q=DNS+traffic&es_..."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[5] "seesaa.net/pede/lobortis/ligula/sit/amet.png?semper=vitae&est=..."

I want to get the part between the first '/' and the one that separates the token with the ?. I wrote the following function

get_directory <- function(x){
  dir <- sapply(strsplit(x, '/'), function(i)sum(grepl('\\?', i)))
  ifelse(dir > 0, sapply(strsplit(x, '/'), function(i) paste(i[-c(1, length(i))], collapse = '/')), 0)
}

But it fails at [3] and [4] URL.

Expected output should be

"cgi-bin"
"0"
"journal"
"0"
"pede/lobortis/liguls/sit"

DATA

dput(df)
structure(list(V1 = c("scripts.iucr.org/cgi-bin/paper?S1600536812045886", 
"cpa-seoadvisors.com/cvv/auth/auth/view/pdf/index.html/", "www.scirp.org/journal/PaperDownload.aspx?DOI=10.4236/csta.2012.13014", 
"www.google.com.cy/search?q=DNS+traffic&es_...", "seesaa.net/pede/lobortis/ligula/sit/amet.png?semper=vitae&est=..."
)), .Names = "V1", row.names = c(NA, -5L), class = "data.frame")

Solution

  • We can use str_extract. Using the regex lookarounds, we match one or more characters (.*) that succeed a / followed by a / and one or more characters that are not ? ([^?]+) followed by a ?.

    library(stringr)
    res <- str_extract(df$V1, "(?<=\\/).*(?=\\/[^?]+[?])")
    res[is.na(res)] <- 0
    res
    #[1] "cgi-bin"                  "0"                        "journal"                 
    #[4] "0"                        "pede/lobortis/ligula/sit"