Search code examples
phpweb-scrapingmeta-tagsdomxpath

How to extract meta tags in PHP if server denies access?


There are so many discussion in the past regarding this. But things have changed a lot. For example in this question

Get title of website via link

which has many solution which worked in the past, but doesn't work now when I check some sites like

https://webdesign.tutsplus.com/articles/the-complete-beginners-guide-to-chinese-fonts--cms-23444

I tried all the methods mentioned on the above SO discussion and none worked for this url. But then I tried the same on this page and they got the title of the page.

http://tools.buzzstream.com/meta-tag-extractor

How did they do it? If not PHP is used, then how to get it done in php? Please suggest an answer other than what is mentioned on the above SO discussion, tried all and none of those work for the tutsplus website. DOMXPATH, file_get_contents(),cURL or adding browser header didn't work.


Solution

  • For me it works (-;

    In this situation was necessary to set USER_AGENT because if you send the request without USER_AGENT then the response is HTTP request failed! HTTP/1.1 403 Forbidden.

    P.S. Always check the errors and responses (-;

    <?php
    function get_title($url){
        $c = curl_init();
        curl_setopt($c, CURLOPT_URL, $url);
        curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($c, CURLOPT_USERAGENT, 'Linux / Firefox 29: Mozilla/5.0 (X11; Linux x86_64; rv:29.0) Gecko/20100101 Firefox/29.0');
    
        $str = curl_exec($c);
        if(strlen($str)>0){
            $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
            preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
            return $title[1];
        }
    }
    //Example:
    echo get_title("https://webdesign.tutsplus.com/articles/the-complete-beginners-guide-to-chinese-fonts--cms-23444");