Search code examples
phpscreen-scraping

scraping a page


What would be best practice in scraping a horrible mess of a distributor's inventory page (using js to document.write a <td>, then using plaintext html to close it)? No divs/tds/anything is labelled with any id or classes, etc.

Should I just straight up preg_match(?_all) the thing or is there some xpath magic I can do? There is no api, no feeds, no xml, nothing clean at all.

edit:

- What i'm basically thinking of atm is something like http://pastebin.com/raw.php?i=EuMfRVD5 - is that my best bet or is there any other way?


Solution

  • Your example is not enough of an example. But since you seemingly don't need the highlighting meta info anyway, the JS-obfuscation could be undone with a bit of:

    $html = preg_replace('# <script .*? (?: document.write\("(.*?)"\) )? .*? </script> #six', "$1", $html);
    

    Maybe that's already good enough to pipe it through one of the DOM libraries afterwards.