someone could help me, this is my problem: I have a list of urls in a tbl and I have to extract the jpg nane. this is the url https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2 and this one the part to extract 13643048_612108275661958_805860992_n thanks for helps
Googling for "R parse URL" could have saved you from typing ~400 keystrokes (tho I expect the URL was pasted).
In any event, you want to process a vector of these things, so there's a better way. In fact there are multiple ways to do this URL path extraction in R. Here are 3:
library(stringi)
library(urltools)
library(httr)
library(XML)
library(dplyr)
We'll generate 100 unique URLs that fit the same Instagram pattern (NOTE: scraping instagram is a violation of their ToS & controlled by robots.txt. If your URLs did not come from the Instagram API, please let me know so I can delete this answer as I don't help content thieves).
set.seed(0)
paste(
"https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2",
stri_rand_strings(100, 8, "[0-9]"), "_",
stri_rand_strings(100, 15, "[0-9]"), "_",
stri_rand_strings(100, 9, "[0-9]"), "_",
stri_rand_strings(100, 1, "[a-z]"),
".jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2",
sep=""
) -> img_urls
head(img_urls)
## [1] "https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2"
## [2] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/66021637_359927357880233_471353444_q.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [3] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/47937926_769874508959124_426288550_z.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [4] "https://https://content_xxx.xxx.com/vp/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/12303834_440673970920272_460810703_n.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [5] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/54186717_202600346704982_713363439_y.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
## [6] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/48675570_402479399847865_689787883_e.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
Now, let's try to parse those URLs:
invisible(urltools::url_parse(img_urls))
invisible(httr::parse_url(img_urls))
## Error in httr::parse_url(img_urls): length(url) == 1 is not TRUE
DOH! httr
can't do it.
invisible(XML::parseURI(img_urls))
## Error in if (is.na(uri)) return(structure(as.character(uri), class = "URI")): the condition has length > 1
DOH! XML
can't do it either.
That means we need to use an sapply()
crutch for httr
and XML
to get the path component (you can run basename()
on any resultant vector as Konrad showed):
data_frame(
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
) -> paths
glimpse(paths)
## Observations: 100
## Variables: 3
## $ urltools <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ httr <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ XML <chr> "/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_...
Note the not really standard inclusion of the initial, /
in the path from XML
. That's not important for you for this example, but it's important to note the difference in general.
We'll process one of them since XML
and httr
have that woeful limitation:
microbenchmark::microbenchmark(
urltools = urltools::url_parse(img_urls[1])$path,
httr = httr::parse_url(img_urls[1])$path,
XML = XML::parseURI(img_urls[1])$path
)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 351.268 397.6040 557.09641 499.2220 618.5945 1309.454 100
## httr 550.298 619.5080 843.26520 717.0705 888.3915 4213.070 100
## XML 11.858 16.9115 27.97848 26.1450 33.9065 109.882 100
XML
looks faster, but it's not in practice:
microbenchmark::microbenchmark(
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 718.887 853.374 1093.404 918.3045 1146.540 2872.076 100
## httr 58513.970 64738.477 80697.548 68908.7635 81549.154 224157.857 100
## XML 1155.370 1245.415 2012.660 1359.8215 1880.372 26184.943 100
If you really want to go the regex route, you can read the RFC for the URL BNF and a naive regex for hacking bits out of one and Google for the seminal example that has over a dozen regular expressions that handle not-so-well-formed URIs, but parsing is generally a better strategy for diverse URL content. For your case, splitting and regex'ing might work just fine but it isn't necessarily going to be that much faster than parsing:
microbenchmark::microbenchmark(
urltools = tools::file_path_sans_ext(basename(urltools::url_parse(img_urls)$path)),
httr = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE))),
XML = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE))),
regex = stri_match_first_regex(img_urls, "/([[:digit:]]{8}_[[:digit:]]{15}_[[:digit:]]{9}_[[:alpha:]]{1})\\.jpg\\?")[,2]
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## urltools 1.140421 1.228988 1.502525 1.286650 1.444522 6.970044 100
## httr 56.563403 65.696242 77.492290 69.809393 80.075763 157.657508 100
## XML 1.513174 1.604012 2.039502 1.702018 1.931468 11.306436 100
## regex 1.137204 1.223683 1.337675 1.260339 1.397273 2.241121 100
As noted in that final example, you'll need to run tools::file_path_sans_ext()
on the result to remove the .jpg
(or sub()
it away).