I have hundreds of TXT files which contain many things and some download links.
The pattern of the download links are like this:
start with: http://
and
end with: .nc
I created a sample text file for your convenience that you could download from this link:
https://www.dropbox.com/s/5crmleli2ppa1rm/textfile_including_https.txt?dl=1
Based on this topic in Stackoverflow, I tried to extract all download links from the text file:
Extract websites links from a text in R
Here is my code:
download_links <- readLines(file.choose())
All_my_links <- gsub(download_links, pattern=".*(http://.*nc).*", replace="\\1")
But it returns all lines, too, while I only want to extract the http links ended with .nc
Here is the result:
head(All_my_links )
tail(All_my_links )
> head(All_my_links )
[1] "#!/bin/bash"
[2] "##############################################################################"
[3] "version=1.3.2"
[4] "CACHE_FILE=.$(basename $0).status"
[5] "openId="
[6] "search_url='https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.HighResMIP.MIROC.NICAM16-9S.highresSST-present.r1i1p1f1.day.pr.gr.v20190830|esgf-data2.diasjp.net'"
> tail(All_my_links )
[1] "MYPROXY_STATUS=$HOME/.MyProxyLogon"
[2] "COOKIE_JAR=$ESG_HOME/cookies"
[3] "MYPROXY_GETCERT=$ESG_HOME/getcert.jar"
[4] "CERT_EXPIRATION_WARNING=$((60 * 60 * 8)) #Eight hour (in seconds)"
[5] ""
[6] "WGET_TRUSTED_CERTIFICATES=$ESG_HOME/certificates"
What is my mistake in the code?
Any comment would be highly appreciated.
gsub()
is not for extracting, that's what's wrong with your code. It's for replacing. (See help("gsub")
). For the purposes of this demonstration, I will use the following data:
x <- c("abc", "123", "http://site.nc")
(I will not, as a rule, download data posted here as a link. Most others won't also. If you want to share example data, it's best to do so by including in your question the output from dput()
).
Let's see what happens with your gsub()
approach:
gsub(pattern = ".*(http://.*nc).*", replacement = "\\1", x = x)
# [1] "abc" "123" "http://site.nc"
Looks familiar. What's going on here is gsub()
looks at each element of x
, and replaces each occurrence of pattern
with replacement
, which in this case is itself. You will always get the exact same character vector back with that approach.
I would suggest stringr::str_extract()
:
stringr::str_extract(string = x, pattern = ".*http://.*nc.*")
# [1] NA NA "http://site.nc"
If you wrap this in na.omit()
, it gives you the output I think you want:
na.omit(stringr::str_extract(string = x, pattern = ".*http://.*nc.*"))
# [1] "http://site.nc"