I'm trying to extract the mileage value from different ebay pages but I'm stuck as there seem to be too many patterns because the pages are a bit different . Therefore I would like to know if you can help me with a better pattern .
Some examples of items are the following :
http://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100
http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110
http://cgi.ebay.com/ebaymotors/ws/eBayISAPI.dll?ViewItemNext&item=250647101696
Please see the patterns at the following link (I still cannot figure it out how to escape the html here):
http://pastebin.com/zk4HAY3T
However they are not enough as it seems there are still new patterns.
This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.
/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i
Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.
Recognizing the duplication there, you could simplify (logically, at least) a bit more:
/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i
You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2}
part. The ?:
tells it not to remember that sequence, so that $matches[1]
still contains the number you're looking for, and the {2}
indicates that you want to match the previous sequence exactly twice.