php regex web-scraping html-parsing text-extraction

Extract mileage value from an eBay webpages

I'm trying to extract the mileage value from different ebay pages but I'm stuck as there seem to be too many patterns because the pages are a bit different . Therefore I would like to know if you can help me with a better pattern . Some examples of items are the following : http://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100 http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110 http://cgi.ebay.com/ebaymotors/ws/eBayISAPI.dll?ViewItemNext&item=250647101696
Please see the patterns at the following link (I still cannot figure it out how to escape the html here):

http://pastebin.com/zk4HAY3T

However they are not enough as it seems there are still new patterns.

Solution

This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.

/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i

Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.

Recognizing the duplication there, you could simplify (logically, at least) a bit more:

/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i

You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2} part. The ?: tells it not to remember that sequence, so that $matches[1] still contains the number you're looking for, and the {2} indicates that you want to match the previous sequence exactly twice.