i need to scrape an https link from two kinds of html
one is like this
<a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com/uploads/order/8c25ce592gfgfgfh99.pdf');">
this is some content Lorem Ipsum Lorem Ipsum Lorem Ipsum <img src="/img/pdf.jpg" width="15"></a
another one is like this
<a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com//webadmin/pdf/order/2018/Aug/hello this is regarding an older document Ors._2018-08-31 12:09:12.pdf');">
this is some content Lorem Ipsum Lorem Ipsum Lorem Ipsum <img src="/img/pdf.jpg" width="15"></a>
the difference in both of them is in the link in newwindow1
, as in second html link contain few spaces
and also link contain string
pdf
two times
now i want to extract the link from both of them
i am using c#
Regex.Match(HtmlString, @"('https[^\s]+.pdf')");
by this way i am able extract link from first html , but in the second html its extracting like this
https://hello.com//webadmin/pdf/
started from https
and stopped at pdf
but the link is not finished yet
apart from regex
please let me know if this can be done by html agility pack
With HtmlAgilityPack, you may parse HTML DOM documents, but you cannot parse JavaScript code with it.
You may only use regex if you know the code is always formatted the way it is shown in the question, i.e. if it the value you need to extract is always inside single quotes. Then, you may use [^']
negated character class that matches any char but a single quote instead of the [^\s]
one that matches any char but whitespace chars.
var url = Regex.Match(HtmlString, @"'https[^']+\.pdf'");
Or, to just get the URL without single quotes:
var url = Regex.Match(HtmlString, @"'(https[^']+\.pdf)'")?.Groups[1].Value;
Note that you should escape the dot outside a character class in the pattern to match a literal dot.