Search code examples
phpfile-get-contentshtml-content-extraction

PHP - how to get main HTML content like Reader Mode in Firefox


in android Firefox app and safari iPad we can read only main content by "Reader Mode". read more... How to recognize only main content in HTML with PHP?

I need to detect main news like Firefox or safari by php

for example I get news from bbcsite.com/news/123 by this code:

<?php
    $html = file_get_contents('http://bbcsite.com/news/123');
?>

then show only main news without ads and ... like Firefox and safari.

I find fivefilters.org . this site can get content!!!

thank you


Solution

  • Hooray!!!

    I found this source code:

    1) create Readability.php

    2) create JSLikeHTMLElement.php

    3) create index.php by this code:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html>
        <head>
            <title>!</title>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        </head>
    <body dir="rtl">
    <?php
    include_once 'Readability.php';
    
    
    // get latest Medialens alert 
    // (change this URL to whatever you'd like to test)
    $url = 'http://';
    $html = file_get_contents($url);
    
    // Note: PHP Readability expects UTF-8 encoded content.
    // If your content is not UTF-8 encoded, convert it 
    // first before passing it to PHP Readability. 
    // Both iconv() and mb_convert_encoding() can do this.
    
    // If we've got Tidy, let's clean up input.
    // This step is highly recommended - PHP's default HTML parser
    // often doesn't do a great job and results in strange output.
    if (function_exists('tidy_parse_string')) {
        $tidy = tidy_parse_string($html, array(), 'UTF8');
        $tidy->cleanRepair();
        $html = $tidy->value;
    }
    
    // give it to Readability
    $readability = new Readability($html, $url);
    // print debug output? 
    // useful to compare against Arc90's original JS version - 
    // simply click the bookmarklet with FireBug's console window open
    $readability->debug = false;
    // convert links to footnotes?
    $readability->convertLinksToFootnotes = true;
    // process it
    $result = $readability->init();
    // does it look like we found what we wanted?
    if ($result) {
        echo "== Title =====================================\n";
        echo $readability->getTitle()->textContent, "\n\n";
        echo "== Body ======================================\n";
        $content = $readability->getContent()->innerHTML;
        // if we've got Tidy, let's clean it up for output
        if (function_exists('tidy_parse_string')) {
            $tidy = tidy_parse_string($content, array('indent'=>true, 'show-body-only' => true), 'UTF8');
            $tidy->cleanRepair();
            $content = $tidy->value;
        }
        echo $content;
    } else {
        echo 'Looks like we couldn\'t find the content. :(';
    }
    ?>
    </body>
    </html>
    

    in $url = 'http://'; set your site url.

    Thank you;)