Search code examples
phpregexweb-scrapingcurlpreg-match-all

How do i scrape multiple lines in the sourcelist using cURL and preg_match_all


Hey im trying to webscrape a specific thing on a website, like this

  <td><a href="javascript:void(0)" class="rankRow"
                                                                           data-rankkey="25">
                                                                                    Averages
                                                                            </a>
                                                                    </td>
                                                                    <td class="page_speed_602217763">
                                                                            82.84                                                                        </td>
                                                            </tr>

Where im trying to get the number 82,84 with the page_speed_** number variying and the on constant that differentiate from the rest of the sourcelist being the text "Averages"

I have tried using the preg_match_all but cant seem to search more than one line and whatevers in between.

My code i have used is the following

<form method="post">
<input type="text" name="Player1Link" placeholder="Player 1"> <br>
</form>

    <?php
$Player1Link = $_POST["Player1Link"];

            $curl = curl_init();
          curl_setopt($curl, CURLOPT_URL, $Player1Link);
          curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
          curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
          $curlresult = curl_exec($curl);
        $pattern = '!data-rankkey="25">[\s]*Averages[\s]*<\/a>[\s]*<\/td>[\s]*<td[^\s]*?class="page_speed_([\d]*)">[\s]*([\d]*.[\d]*)[\s]*</td>[\s]*<\/tr>!';
  preg_match_all($pattern, $curlresult, $matches);
          print_r($matches);
        
          $P1AvgHigh = $matches[0][3];
          echo "<br>";
          echo $P1AvgHigh;
          curl_close($curl);
    ?>

With my results being enter image description here and the website im using is

https://app.dartsorakel.com/player/stats/8 and the sourcelink view-source:https://app.dartsorakel.com/player/stats/8

Thanks in advance


Solution

  • As a wise man once said:

    You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

    Parsing HTML with Regex is nearly always a bad idea.

    Use a proper HTML parser instead, for example XPath: //td[contains(@class, 'page_speed_')]

    sample:

    $html='  <td><a href="javascript:void(0)" class="rankRow"
                                                                               data-rankkey="25">
                                                                                        Averages
                                                                                </a>
                                                                        </td>
                                                                        <td class="page_speed_602217763">
                                                                                82.84                                                                        </td>
                                                                </tr
    >';
    $domd = new DOMDocument();
    @$domd->loadHTML($html);
    $xp = new DOMXPath($domd);
    $page_speed = $xp->query("//td[contains(@class, 'page_speed_')]")->item(0)->textContent;
    $page_speed = trim($page_speed);
    var_dump($page_speed);
    

    dumps:

    string(5) "82.84"