Search code examples
rurljpeg

Extract jpg name from a url using R


someone could help me, this is my problem: I have a list of urls in a tbl and I have to extract the jpg nane. this is the url https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2 and this one the part to extract 13643048_612108275661958_805860992_n thanks for helps


Solution

  • Googling for "R parse URL" could have saved you from typing ~400 keystrokes (tho I expect the URL was pasted).

    In any event, you want to process a vector of these things, so there's a better way. In fact there are multiple ways to do this URL path extraction in R. Here are 3:

    library(stringi)
    library(urltools)
    library(httr)
    library(XML)
    library(dplyr)
    

    We'll generate 100 unique URLs that fit the same Instagram pattern (NOTE: scraping instagram is a violation of their ToS & controlled by robots.txt. If your URLs did not come from the Instagram API, please let me know so I can delete this answer as I don't help content thieves).

    set.seed(0)
    
    paste(
      "https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2",
      stri_rand_strings(100, 8, "[0-9]"), "_",
      stri_rand_strings(100, 15, "[0-9]"), "_",
      stri_rand_strings(100, 9, "[0-9]"), "_",
      stri_rand_strings(100, 1, "[a-z]"),
      ".jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2",
      sep=""
    ) -> img_urls
    
    head(img_urls)
    ## [1] "https://content_xxx.xxx.com/vp/969ffffff61/5C55ABEB/t51.2ff5-15/e35/13643048_612108275661958_805860992_n.jpg?ff_cache_key=fffffQ%3ff%3D.2"
    ## [2] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/66021637_359927357880233_471353444_q.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
    ## [3] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/47937926_769874508959124_426288550_z.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
    ## [4] "https://https://content_xxx.xxx.com/vp/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/12303834_440673970920272_460810703_n.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
    ## [5] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/54186717_202600346704982_713363439_y.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
    ## [6] "https://https://content_xxx.xxx.com/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/48675570_402479399847865_689787883_e.jpg?ff_cache_key=MTMwOTE4NjEyMzc1OTAzOTc2NQ%3D%3D.2"
    

    Now, let's try to parse those URLs:

    invisible(urltools::url_parse(img_urls))
    
    invisible(httr::parse_url(img_urls))
    ## Error in httr::parse_url(img_urls): length(url) == 1 is not TRUE
    

    DOH! httr can't do it.

    invisible(XML::parseURI(img_urls))
    ## Error in if (is.na(uri)) return(structure(as.character(uri), class = "URI")): the condition has length > 1
    

    DOH! XML can't do it either.

    That means we need to use an sapply() crutch for httr and XML to get the path component (you can run basename() on any resultant vector as Konrad showed):

    data_frame(
      urltools = urltools::url_parse(img_urls)$path,
      httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
      XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
    ) -> paths
    
    glimpse(paths)
    ## Observations: 100
    ## Variables: 3
    ## $ urltools <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
    ## $ httr     <chr> "vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_h...
    ## $ XML      <chr> "/vp/969b7087cc97408ccee167d473388761/5C55ABEB/t51.2885-15/e35/82359289_380972639303339_908467218_...
    

    Note the not really standard inclusion of the initial, / in the path from XML. That's not important for you for this example, but it's important to note the difference in general.

    We'll process one of them since XML and httr have that woeful limitation:

    microbenchmark::microbenchmark(
      urltools = urltools::url_parse(img_urls[1])$path,
      httr = httr::parse_url(img_urls[1])$path,
      XML = XML::parseURI(img_urls[1])$path
    )
    ## Unit: microseconds
    ##      expr     min       lq      mean   median       uq      max neval
    ##  urltools 351.268 397.6040 557.09641 499.2220 618.5945 1309.454   100
    ##      httr 550.298 619.5080 843.26520 717.0705 888.3915 4213.070   100
    ##       XML  11.858  16.9115  27.97848  26.1450  33.9065  109.882   100
    

    XML looks faster, but it's not in practice:

    microbenchmark::microbenchmark(
      urltools = urltools::url_parse(img_urls)$path,
      httr = sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE),
      XML = sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE)
    )
    ## Unit: microseconds
    ##      expr       min        lq      mean     median        uq        max neval
    ##  urltools   718.887   853.374  1093.404   918.3045  1146.540   2872.076   100
    ##      httr 58513.970 64738.477 80697.548 68908.7635 81549.154 224157.857   100
    ##       XML  1155.370  1245.415  2012.660  1359.8215  1880.372  26184.943   100
    

    If you really want to go the regex route, you can read the RFC for the URL BNF and a naive regex for hacking bits out of one and Google for the seminal example that has over a dozen regular expressions that handle not-so-well-formed URIs, but parsing is generally a better strategy for diverse URL content. For your case, splitting and regex'ing might work just fine but it isn't necessarily going to be that much faster than parsing:

    microbenchmark::microbenchmark(
      urltools = tools::file_path_sans_ext(basename(urltools::url_parse(img_urls)$path)),
      httr = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) httr::parse_url(URL)$path, USE.NAMES = FALSE))),
      XML = tools::file_path_sans_ext(basename(sapply(img_urls, function(URL) XML::parseURI(URL)$path, USE.NAMES = FALSE))),
      regex = stri_match_first_regex(img_urls, "/([[:digit:]]{8}_[[:digit:]]{15}_[[:digit:]]{9}_[[:alpha:]]{1})\\.jpg\\?")[,2]
    )
    ## Unit: milliseconds
    ##      expr       min        lq      mean    median        uq        max neval
    ##  urltools  1.140421  1.228988  1.502525  1.286650  1.444522   6.970044   100
    ##      httr 56.563403 65.696242 77.492290 69.809393 80.075763 157.657508   100
    ##       XML  1.513174  1.604012  2.039502  1.702018  1.931468  11.306436   100
    ##     regex  1.137204  1.223683  1.337675  1.260339  1.397273   2.241121   100
    

    As noted in that final example, you'll need to run tools::file_path_sans_ext() on the result to remove the .jpg (or sub() it away).