Hey im trying to webscrape a specific thing on a website, like this
<td><a href="javascript:void(0)" class="rankRow"
data-rankkey="25">
Averages
</a>
</td>
<td class="page_speed_602217763">
82.84 </td>
</tr>
Where im trying to get the number 82,84 with the page_speed_** number variying and the on constant that differentiate from the rest of the sourcelist being the text "Averages"
I have tried using the preg_match_all but cant seem to search more than one line and whatevers in between.
My code i have used is the following
<form method="post">
<input type="text" name="Player1Link" placeholder="Player 1"> <br>
</form>
<?php
$Player1Link = $_POST["Player1Link"];
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $Player1Link);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$curlresult = curl_exec($curl);
$pattern = '!data-rankkey="25">[\s]*Averages[\s]*<\/a>[\s]*<\/td>[\s]*<td[^\s]*?class="page_speed_([\d]*)">[\s]*([\d]*.[\d]*)[\s]*</td>[\s]*<\/tr>!';
preg_match_all($pattern, $curlresult, $matches);
print_r($matches);
$P1AvgHigh = $matches[0][3];
echo "<br>";
echo $P1AvgHigh;
curl_close($curl);
?>
With my results being and the website im using is
https://app.dartsorakel.com/player/stats/8 and the sourcelink view-source:https://app.dartsorakel.com/player/stats/8
Thanks in advance
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
Parsing HTML with Regex is nearly always a bad idea.
Use a proper HTML parser instead, for example XPath: //td[contains(@class, 'page_speed_')]
sample:
$html=' <td><a href="javascript:void(0)" class="rankRow"
data-rankkey="25">
Averages
</a>
</td>
<td class="page_speed_602217763">
82.84 </td>
</tr
>';
$domd = new DOMDocument();
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$page_speed = $xp->query("//td[contains(@class, 'page_speed_')]")->item(0)->textContent;
$page_speed = trim($page_speed);
var_dump($page_speed);
dumps:
string(5) "82.84"