Search code examples
phpregexcurlweb-scrapingpreg-match

How to download Source code of url and find specifice text


I want to download source code from url and find out the specific text and store it into variables.

Suppose I have url http://www.homedepot.com/p/Ryobi-185-MPH-510-CFM-Gas-Backpack-Blower-RY08420A/203312654

I want to download its source code and find out below text which is bottom of source code. Also store each variable like CI_Pagetype , CI_ItemID in php variable so I can store it in csv file.

<script>
    var CI_Pagetype = 'PRODUCT';
    var CI_ItemID = '203312654';
    var CI_ItemName = '185 MPH 510 CFM Gas Backpack Blower';
    var CI_CatID = '556375';
    var CI_CatName = '';
    var CI_ItemPrice = $('#ciItemPrice').val();
    var CI_ItemMfr = 'Ryobi';
    var CI_ItemMfrNum = '573539';
    var CI_ItemUPC = '046396001122';
    var CI_ItemAvailability = $('#ciItemAvailability').val();
    var CI_ItemISBN = '';
    var CI_ItemShipWeight = '22';

Currently I can download source code using file_get_contents();

But I am not sure how can I write regexp or extract that data.

Please help me out to find solutions.


Solution

  • Via this site : https://regex101.com/

    With this regex : var (CI_)([A-Za-z0-9]*) = '([a-zA-z0-9 ]*)';

    Use it with g (global) parameter

    For this sample :

        <script>
    var CI_Pagetype = 'PRODUCT';
    var CI_ItemID = '203312654';
    var CI_ItemName = '185 MPH 510 CFM Gas Backpack Blower';
    var CI_CatID = '556375';
    var CI_CatName = '';
    var CI_ItemPrice = $('#ciItemPrice').val();
    var CI_ItemMfr = 'Ryobi';
    var CI_ItemMfrNum = '573539';
    var CI_ItemUPC = '046396001122';
    var CI_ItemAvailability = $('#ciItemAvailability').val();
    var CI_ItemISBN = '';
    var CI_ItemShipWeight = '22';
    
    var bcData = new Object();
    

    Result :

    MATCH 1
    1.  [19-22] `CI_`
    2.  [22-30] `Pagetype`
    3.  [34-41] `PRODUCT`
    MATCH 2
    1.  [52-55] `CI_`
    2.  [55-61] `ItemID`
    3.  [65-74] `203312654`
    MATCH 3
    1.  [85-88] `CI_`
    2.  [88-96] `ItemName`
    3.  [100-135]   `185 MPH 510 CFM Gas Backpack Blower`
    MATCH 4
    1.  [146-149]   `CI_`
    2.  [149-154]   `CatID`
    3.  [158-164]   `556375`
    MATCH 5
    1.  [175-178]   `CI_`
    2.  [178-185]   `CatName`
    3.  [189-189]   ``
    MATCH 6
    1.  [248-251]   `CI_`
    2.  [251-258]   `ItemMfr`
    3.  [262-267]   `Ryobi`
    MATCH 7
    1.  [278-281]   `CI_`
    2.  [281-291]   `ItemMfrNum`
    3.  [295-301]   `573539`
    MATCH 8
    1.  [312-315]   `CI_`
    2.  [315-322]   `ItemUPC`
    3.  [326-338]   `046396001122`
    MATCH 9
    1.  [411-414]   `CI_`
    2.  [414-422]   `ItemISBN`
    3.  [426-426]   ``
    MATCH 10
    1.  [437-440]   `CI_`
    2.  [440-454]   `ItemShipWeight`
    3.  [458-460]   `22`
    

    Price and availability is function so there are no value.

    $re = "/var (CI_)([A-Za-z0-9]*) = '([a-zA-z0-9 ]*)';/"; 
    $str = "<script>\nvar CI_Pagetype = 'PRODUCT';\nvar CI_ItemID = '203312654';\nvar CI_ItemName = '185 MPH 510 CFM Gas Backpack Blower';\nvar CI_CatID = '556375';\nvar CI_CatName = '';\nvar CI_ItemPrice = \$('#ciItemPrice').val();\nvar CI_ItemMfr = 'Ryobi';\nvar CI_ItemMfrNum = '573539';\nvar CI_ItemUPC = '046396001122';\nvar CI_ItemAvailability = \$('#ciItemAvailability').val();\nvar CI_ItemISBN = '';\nvar CI_ItemShipWeight = '22';\n\nvar bcData = new Object();"; 
    
    preg_match_all($re, $str, $matches);