someone could help me, this is my problem: I have a list of urls in a tbl and I have to extract the jpg nane. this is the url and this one the part to extract 13643048_612108275661958_805860992_n thanks for helps
Googling for "R parse URL" could have saved you from typing ~400 keystrokes (tho I expect the URL was pasted).
In any event, you want to process a vector of these things, so there's a better way. In fact there are multiple ways to do this URL path extraction in R. Here are 3:
We'll generate 100 unique URLs that fit the same Instagram pattern (NOTE: scraping instagram is a violation of their ToS & controlled by robots.txt. If your URLs did not come from the Instagram API, please let me know so I can delete this answer as I don't help content thieves).
stri_rand_strings(100, 8, "[0-9]"), "_",
stri_rand_strings(100, 15, "[0-9]"), "_",
stri_rand_strings(100, 9, "[0-9]"), "_",
stri_rand_strings(100, 1, "[a-z]"),
) -> img_urls
## [1] ""
## [2] "https://"
## [3] "https://"
## [4] "https://"
## [5] "https://"
## [6] "https://"
Now, let's try to parse those URLs:
## Error in httr::parse_url(img_urls): length(url) == 1 is not TRUE
DOH! httr
can't do it.
## Error in if ( return(structure(as.character(uri), class = "URI")): the condition has length > 1
can't do it either.
That means we need to use an sapply()
crutch for httr
and XML
to get the path component (you can run basename()
on any resultant vector as Konrad showed):
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
) -> paths
## Observations: 100
## Variables: 3
## $ urltools <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ httr <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
## $ XML <chr> "/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_...
Note the not really standard inclusion of the initial, /
in the path from XML
. That's not important for you for this example, but it's important to note the difference in general.
We'll process one of them since XML
and httr
have that woeful limitation:
urltools = urltools::url_parse(img_urls[1])$path,
httr = httr::parse_url(img_urls[1])$path,
XML = XML::parseURI(img_urls[1])$path
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 351.268 397.6040 557.09641 499.2220 618.5945 1309.454 100
## httr 550.298 619.5080 843.26520 717.0705 888.3915 4213.070 100
## XML 11.858 16.9115 27.97848 26.1450 33.9065 109.882 100
looks faster, but it's not in practice:
urltools = urltools::url_parse(img_urls)$path,
httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
## Unit: microseconds
## expr min lq mean median uq max neval
## urltools 718.887 853.374 1093.404 918.3045 1146.540 2872.076 100
## httr 58513.970 64738.477 80697.548 68908.7635 81549.154 224157.857 100
## XML 1155.370 1245.415 2012.660 1359.8215 1880.372 26184.943 100
If you really want to go the regex route, you can read the RFC for the URL BNF and a naive regex for hacking bits out of one and Google for the seminal example that has over a dozen regular expressions that handle not-so-well-formed URIs, but parsing is generally a better strategy for diverse URL content. For your case, splitting and regex'ing might work just fine but it isn't necessarily going to be that much faster than parsing:
urltools = tools::file_path_sans_ext(basename(urltools::url_parse(img_urls)$path)),
httr = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE))),
XML = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE))),
regex = stri_match_first_regex(img_urls, "/([[:digit:]]{8}_[[:digit:]]{15}_[[:digit:]]{9}_[[:alpha:]]{1})\\.jpg\\?")[,2]
## Unit: milliseconds
## expr min lq mean median uq max neval
## urltools 1.140421 1.228988 1.502525 1.286650 1.444522 6.970044 100
## httr 56.563403 65.696242 77.492290 69.809393 80.075763 157.657508 100
## XML 1.513174 1.604012 2.039502 1.702018 1.931468 11.306436 100
## regex 1.137204 1.223683 1.337675 1.260339 1.397273 2.241121 100
As noted in that final example, you'll need to run tools::file_path_sans_ext()
on the result to remove the .jpg
(or sub()
it away).