What would be best practice in scraping a horrible mess of a distributor's inventory page (using js to document.write a <td>, then using plaintext html to close it)? No divs/tds/anything is labelled with any id or classes, etc.
Should I just straight up preg_match(?_all) the thing or is there some xpath magic I can do? There is no api, no feeds, no xml, nothing clean at all.
edit:
- What i'm basically thinking of atm is something like http://pastebin.com/raw.php?i=EuMfRVD5 - is that my best bet or is there any other way?
Your example is not enough of an example. But since you seemingly don't need the highlighting meta info anyway, the JS-obfuscation could be undone with a bit of:
$html = preg_replace('# <script .*? (?: document.write\("(.*?)"\) )? .*? </script> #six', "$1", $html);
Maybe that's already good enough to pipe it through one of the DOM libraries afterwards.