php regex web-scraping html-parsing text-extraction

Scrape href values from qualifying hyperlinks in Amazon's search results HTML

I've been trying to build a simple scraper that would take a keyword, then go to Amazon and enter the keyword into the search box, then scrape the main results only.

The problem is that the Regex isn't working. I've tried many different ways, but it's still not working properly.

$url = "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=dog+bed&x=0&y=0";

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$return = curl_exec($ch);
curl_close($ch);

preg_match_all('(<div.*class="data">.*<div class="title">.*<a.*class="title".*href="(.*?)">(.*?)</a>)', $return, $matches);

var_dump($matches);

Now Amazon's HTML code looks like this:

<div class="title">
    <a class="title" href="http://www.amazon.com/Midwest-40236--23-Inch-Quiet-Time/dp/B00063KG7S/ref=sr_1_1?ie=UTF8&amp;qid=1307126379&amp;sr=8-1">Midwest 40236 36-By-23-Inch Quiet Time Bolster Pet Bed, Fleece</a>
    <span class="ptBrand">by Midwest Homes for Pets</span>
    <span class="bindingAndRelease">(Nov 30, 2006)</span>
</div>

I've tried to change the Regex a million different ways, but what I've learned over the past few months just isn't working, at all. Of course, if I just change it to href="(.*?)" - I get every link on there...but not when I add in the <a class="title" in it.

Solution

Parsing complex structures with a regex often fails. The regex gets complicate and even you put lot of efforts in, it never properly works. That's by the nature of the data you would like to analyse and the limitation of regexes.

When website's weren't that complex, I did the following which often works well for a quick solution:

find a string that marks the beginning of the part that is interesting, cut everyhing out before. Then find a string that marks the end and cut out everything afterwards.

and then parse :)

nowadays if you need something flexible, write yourself a cache layer so you automatically can have a copy of the resources you need to scrape so you can code your scraper w/o the overhead to request external data all the time all over again while developing the right scraping strategy (it does not change that fast).

Then convert the HTML into XML for example with DomDocument in PHP. That works very well once you've done that two or three times. You might run in encoding problems and syntax problems, but those can be solved. And things got much better compared to some years ago.

Then you could step into Xpath which is quite flexible to run expressions on the XML.

But next to that there is a PHP lib that really is super-cool: FluentDOM.

It combines the best of DomDocument, XPath and PHP and is quite flexible.

Some examples & resources by the author of FluentDOM I can suggest: