Search code examples
phpweb-scrapingkeywordmeta-tagssimple-html-dom

can't able to fetch meta tag for particular url


I am using php script for fetch the keywords from meta tag for the particular website. but for some URL it is not working, when i manually check the keywords for that URL then i found that keywords are exist in the web page.

$url = "https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/?tid=pm_pop";
    get_meta_tags($url);

It always give me warning:-

Warning: get_meta_tags(https://www.washingtonpost.com/politics/stormy-danielss-tale-gains-renewed-momentum-with-trump-lawyers-claim-which-raises-new-questions/2018/02/14/e7ce4a16-119d-11e8-9065-e55346f6de81_story.html?tid=pm_pop): failed to open stream: Redirection limit reached

Any Idea?


Solution

  • let's go :

    • first : there's a infinty redirect loop ,so the server will give you the page only if you enable cookies . so we will use the curl function to get the html page , by 2 step:

      1. get the cookies
      2. Resend cookies and get the page

    • second : parsing html to get meta tags by using preg_match :

    • At last the code will be :

      https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/?tid=pm_pop');
      //parsing begins here:
      preg_match_all('/<[\s]meta[\s](name|property)="?' . '([^>"])"?[\s]' . 'content="?([^>"])"?[\s][/]?[\s]*>/si', $html, $match);
      $count = count($match[2]);
      for($i = 0; $i < $count; $i++){
          echo($match[2][$i]." : ".$match[3][$i]."
      "); }

      function get_contents($link) { $result =""; try{ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $link); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt ($ch, CURLOPT_COOKIEJAR, "-"); // <-- see here $result = curl_exec($ch); // remember i didn't close the curl yet!
      // Now make another curl request with the same handle: curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); $result = curl_exec($ch); // if you are done, you can close it. $result = curl_exec($ch); $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE); $curlerr = curl_error($ch); curl_close($ch); } catch (Exception $e) { $result = "Error1 :". $result."||".$e; } if(strlen($result) < 5){$result = $result."Error :".$httpcode.$curlerr;}
      return $result; } ?>

    Note : the html cannot be parsing by domdocument

    Output :

    object-hash : 1518960831
    referrer : unsafe-url
    keywords : Florida school shooting, Marjory Stoneman Douglas High School, Parkland school shooting, Florida shooting, Broward County
    news_keywords : Florida school shooting, Marjory Stoneman Douglas High School, Parkland school shooting, Florida shooting, Broward County
    twitter:card : summary_large_image
    og:type : article
    og:site_name : Washington Post
    magnet : floridashooting
    article:publisher : https://www.facebook.com/washingtonpost
    fb:app_id : 41245586762
    fb:admins : 4403963
    fb:admins : 500835072
    article:content_tier : metered
    og:url : https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/
    og:title : ‘A horrific, horrific day’: At least 17 killed in Florida school shooting
    og:description : The suspect, a student who had been expelled, was armed with an AR-15, authorities said.
    robots : index,follow
    theme : normal
    audio_url : 
    twitter:creator : @lori_rozsa
    article:author : https://www.facebook.com/moriah.balingit
    author : https://www.facebook.com/moriah.balingit
    twitter:creator : @ByMoriah
    twitter:creator : @thewanreport
    article:author : https://www.facebook.com/markberman
    author : https://www.facebook.com/markberman
    twitter:creator : @markberman