Search code examples
phpcurlweb-scrapingsimple-html-dom

Trouble getting the name of a product from a webpage


I've written a script in php to scrape the title of a product located at the top right corner in a webpage. The title is visible as Gucci.

when I execute my below script, it gives me an error Notice: Trying to get property 'plaintext' of non-object in C:\xampp\htdocs\runcode\testfile.php on line 16.

How can I get only the name Gucci from that webpage?

Link to the url

I've written so far:

<?php
include "simple_html_dom.php";
$link = "https://www.farfetch.com//bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"; 

function get_content($url)
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_HTTPHEADER, array('User-Agent: Mozilla/5.0',));
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        $itemTitle = $dom->find('#bannerComponents-Container [itemprop="name"]', 0)->plaintext;
        echo "{$itemTitle}";
    }
get_content($link);
?>

Btw, the selector I've used within the script is flawless.

To clear the confusion I've copied a chunk of html elements from the page source which neither generats dynamically nor javascript encrypted so I don't find any reason for curl not to be able to handle that:

<div class="cdb2b6" id="bannerComponents-Container">
    <p class="_41db0e _527bd9 eda00d" data-tstid="merchandiseTag">New Season</p>
    <div class="_1c3e57">
        <h1 class="_61cb2e" itemProp="brand" itemscope="" itemType="http://schema.org/Brand">
            <a href="/bd/shopping/men/gucci/items.aspx" class="fd9e8e e484bf _4a941d f140b0" data-trk="pp_infobrd" data-tstid="cardInfo-title" itemProp="url" aria-label="Gucci">
                <span itemProp="name">Gucci</span>
            </a>
        </h1>
    </div>
</div>

Post script: It's very pathetic that I had to show a real life example from another language to make sure the name Gucci is not dynamically generated as few comments and an answer have already indicated that

The following script is written in python (using requests module which can't handle dynamic content):

import requests
from bs4 import BeautifulSoup

url = "https://www.farfetch.com//bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"

with requests.Session() as s:
    s.headers["User-Agent"] = "Mozilla/5.0"
    res = s.get(url)
    soup = BeautifulSoup(res.text,"lxml")
    item = soup.select_one('#bannerComponents-Container [itemprop="name"]').text
    print(item)

Output It produces:

Gucci

Now, it's clear that the content I look for is static.

Please check out the below image to recognize the title which I've already marked by a pencil.

enter image description here


Solution

  • The main difference between your successful Python script and your PHP script is the use of session. Your PHP script doesn't use cookies, and that triggers a differend response from the server.

    We have two options:

    1. Change the selector. As mentioned in Mark's answer, the item is still on the html, but in a different tag. We could get it with this selector:

      'a[itemprop="brand"]'
      
    2. Use cookies. We can get the same response as your Python script if we use CURLOPT_COOKIESESSION and a temporary file to write/read the cookies.

      function get_content($url) {
          $cookieFileh = tmpfile();
          $cookieFile=stream_get_meta_data($cookieFileh)['uri'];
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_URL, $url);
          curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
          curl_setopt($ch, CURLOPT_COOKIESESSION, true);
          curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
          curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); 
          curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //
          curl_setopt($ch, CURLOPT_ENCODING, "gzip");
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
          curl_exec($ch);
          $htmlContent = curl_exec($ch);
          curl_close($ch);
          fclose($cookieFileh); // thanks to tmpfile(), this also deletes the cookie file.
          $dom = new simple_html_dom();
          $dom->load($htmlContent);
          $itemTitle = $dom->find('#bannerComponents-Container [itemprop="name"]', 0)->plaintext;
          echo "{$itemTitle}";
      }
      
      $link = "https://www.farfetch.com/bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"; 
      get_content($link);
      //Gucci
      

      This script performs two requests; the first request writes the cookies to file, the second reads and uses them.

      In this case the server returns a compressed response, so I've used CURLOPT_ENCODING to unzip the contents.

      Since you use headers only to set a user-agent, it's best to use the CURLOPT_USERAGENT option.

      I've set CURLOPT_SSL_VERIFYPEER to false because I haven't set a certificate, and CURL fails to use HTTPS. If you can communicate with HTTPS sites it's best not to use this option for security reasons. If not, you could set a certifcate with CURLOPT_CAINFO.