Search code examples
javascriptregexweb-scrapingwildcard

How to get data from string using Javascript Regex


I can't post the exact data i'm trying to extract but here's a basic scenario with the same outcome. I'm grabbing the body of a page and trying to extract a bit.ly link from it. So let's say for example, this is the chunk of data where i'm trying to grab the link from.

String:

<a href="/l.php?u=http%3A%2F%2Fbit.ly%2FPq8AkS&amp;h=aAQFZxdL0&amp;s=1" target="_blank"    rel="nofollow nofollow" onmouseover="LinkshimAsyncLink.swap(this, &quot;http:\\/\\/bit.ly\\/Pq8AkS&quot;);" onclick="LinkshimAsyncLink.referrer_log(this, &quot;http:\\/\\/bit.ly\\/Pq8AkS&quot;, &quot;http:\\/\\/www.facebook.com\\/si\\/ajax\\/l\\/render_linkshim_log\\/?u=http\\u00253A\\u00252F\\u00252Fbit.ly\\u00252FPq8AkS&amp;h=aAQFZxdL0&amp;s=1&quot;);">http://bit.ly/Pq8AkS</a></div><div class="shareUnit"><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__wrapper"><div><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__root -cx-PRIVATE-fbTimelineExternalShareUnit__hasImage"><a class="-cx-PRIVATE-fbTimelineExternalShareUnit__video -cx-PRIVATE-fbTimelineExternalShareUnit__image -cx-PRIVATE-fbTimelineExternalShareUnit__content" ajaxify="/ajax/flash/expand_inline.php?target_div=uikk85_59&amp;share_id=271663136271285&amp;max_width=403&amp;max_height=403&amp;context=timelineSingle" rel="async" href="#" onclick="CSS.addClass(this, &quot;-cx-PRIVATE-fbTimelineExternalShareUnit__loading&quot;);CSS.removeClass(this, &quot;-cx-PRIVATE-fbTimelineExternalShareUnit__video&quot;);"><i class="-cx-PRIVATE-fbTimelineExternalShareUnit__play"></i><img class="img" src="http://external.ak.fbcdn.net/safe_image.php?d=AQDoyY7_wjAyUtX2&amp;w=155&amp;h=114&amp;url=http%3A%2F%2Fi1.ytimg.com%2Fvi%2FDre21lBu2zU%2Fmqdefault.jpg" alt="" /></a>

Now, I can get what i'm looking for with the following code but the link isn't always going to be exactly 6 characters long. So this causes an issue...

Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.{6})&amp;h/g;
Matches = regex.exec(Body);

Here's what I was orginally trying but the problem I have is that it grabs too much data. It's going all the way to the last "&amp;h" in the string above instead of stopping at the first one it hits.

Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.*)&amp;h/g;
Matches = regex.exec(Body);

So basically the main part of the string i'm trying to focus on is "%2Fbit.ly%2FPq8AkS&amp;h" so that I can get the "Pq8AkS" out of it. When I use the (.*) it's grabbing everything between "%2F" and the very last "&amp;h" in the large string above.


Solution

  • You should not be using a regex on HTML. Use DOM functions to get the desired link object, then get the href attribute from that, then you can use a regex on just the href.

    By default .* is greedy meaning that it matches the most it can match and still find a match. If you want it to be non-greedy (match the least possible), you can use this .*? instead like this:

    regex = /2Fbit.ly%2F(.*?)&amp;h/;
    

    I also don't think you want the g flag on the regex as there should only be one match in the right URL.

    If you show the rest of your HTML, we could offer advice on finding the right link object rather than trying to match the entire body HTML.


    FYI, another trick for a non-greedy match is to do something like this:

    regex = /2Fbit.ly%2F([^&]*)&amp;h/;
    

    Which matches a series of characters that are not & followed by &amp;h which accomplishes the same goal as long as & can't be in the matched sequence.