Search code examples
phpwebscreen-scraping

Php Scraping - How do I catch the variable in the source code?


In the following html code, I want to catch the variable "1.31". Thank you for your help already.

Source Code
    <div style="font-size:20px">1.31 <i class="fa fa-try"></i> <span style="text-decoration: line-through; color:#919191; font-size: 14px; margin-top: 7px; margin-right: 5px; float:left" itemprop="price" content="1.55">1.55 <i class="fa fa-try" itemprop="priceCurrency" content="TL"></i></span>
    <link itemprop="availability" href="http://schema.org/InStock">
    </div>

<?php

$url = "https://www.oyunfor.com/knight-online/gb-gold-bar";

$url_connect = file_get_contents($url);

preg_match('@<div style="font-size:20px">(.*?)<i@si',$url_connect,$results);

print_r($results);

?>

Solution

  • Your code works perfectly fine, I would however suggest a minor modification:

    <?php
    $markup = <<<HTML
        <div style="font-size:20px">1.31 <i class="fa fa-try"></i> <span style="text-decoration: line-through; color:#919191; font-size: 14px; margin-top: 7px; margin-right: 5px; float:left" itemprop="price" conten
        <link itemprop="availability" href="http://schema.org/InStock">
        </div>
    HTML;
    
    preg_match('@<div style="font-size:20px">(.*?)<i@si', $markup, $results);
    var_dump($results[1]);
    

    The output of that is:

    string(5) "1.31 "
    

    UPDATE:

    As you point out in the comments below you do not get the expected result if instead of using static markup as shown in the example for demonstration purpose you implement an internal http request fetching that markup from some remote server as you showed in your question.

    Reason for that is that the markup you receive that way does not match the example you gave in your question. It differs slightly which causes your regular expression not to match. That is the main reason why regular expressions are considered a bad approach to parse such markup: they break to easily the moment some minor changes occur to the subject markup.

    To be more specific: the markup your receive back actually is invalid. You probably did not realize that because you visualized it in a browser. But note that browser try to "fix" things to make it usable. For debugging you need to look at things without such intermediate layers to learn what you actually deal with. Here you should have dumped the markup you receive into some log file.

    Anyway: you can modify your regular expression slightly to allow it to match again. This is what I would suggest, using that results in the same output again as shown above.

    @<div\s+[^>]*style="?font-size:20px"?[^>]*>(.*?)<i@si