I'm working on some code using a web scraper in Python.
Given a website's source code, I need to extract relevant data points. The source code looks like this.
</sup>73.00</span> </td> </tr> <tr class="highlight"> <td><span class="data_lbl">Average</span></td> <td> <span class="data_data"><sup>
</sup>86.06</span> </td> </tr> <tr> <td><span class="data_lbl">Current Price</span></td> <td> <span class="data_data"><sup> </sup>83.20</span> </td>
</tr> </tbody> </table> </div> </div> <!--data-module-name="quotes.module.researchratings.Module"--> </div> <div class="column at8-
col4 at16-col4 at12-col6" id="adCol"> <div intent in-at4units-prepend="#adCol" in-at8units-prepend="#adCol" in-at12units-prepend="#adCol
Here is the regex I'm using
regex = re.compile('Average*</sup>.....')
Which aims to get the 5 characters after the first "/sup" tag encountered after "Average", which in this case would be "86.06" (although I need to clean up the match before I'm left with just a float).
Is there a more elegant way of doing this that outputs the first float encountered after seeing the string "Average".
I'm very new to using regex and apologize if the question isn't clear enough.
I've been able to achieve that using lookbehind assertions combined with ungreedy search:
(?<=Average).*?(?<=<\/sup>)([0-9.]{5})
This working example here
Explanation
([0-9.]{5})
: look for 5 chars combining 0 to 9 and dot, after three following points.
(?<=Average)
: the word Average must appear before.*?
: any amount of chars between. Non-greedy (will match as less chars as possible)(?<=<\/sup>)
: the tag </sup>
must appear beforeThe number you're looking for will be in the first capture group