Search code examples
phparraysregexpreg-replacepreg-match

Text to html ratio of a page issue


I am trying to get Text to HTML Ratio on a given webpage. I am using a strip_html_tags to strip out the html tags and comparing it to the original content on the page to get the ratio. My issue is that I feel like my strip_html_tags function may not get all the tags on webpage. Is there a better way to do this... maybe that just replaces everything that starts with < and >. I can already point out that I am missing a lot of tags that should be stripped in the regex but there has to be a better way to do all this.

function strip_html_tags($text)
{
    $text = preg_replace(array(
        '@<head[^>]*?>.*?</head>@siu',
        '@<style[^>]*?>.*?</style>@siu',
        '@<script[^>]*?.*?</script>@siu',
        '@<object[^>]*?.*?</object>@siu',
        '@<embed[^>]*?.*?</embed>@siu',
        '@<applet[^>]*?.*?</applet>@siu',
        '@<noframes[^>]*?.*?</noframes>@siu',
        '@<noscript[^>]*?.*?</noscript>@siu',
        '@<noembed[^>]*?.*?</noembed>@siu',
        '@</?((address)|(blockquote)|(center)|(del))@iu',
        '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
        '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
        '@</?((table)|(th)|(td)|(caption))@iu',
        '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
        '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
        '@</?((frameset)|(frame)|(iframe))@iu',
        '#<[\/\!]*?[^<>]*?>#siu', // Strip out HTML tags
        '#<![\s\S]*?--[ \t\n\r]*>#siu' // Strip multi-line comments including CDATA
    ), array(
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0"
    ), $text);
    return strip_tags($text);
}

function check_ratio($url)
{
    $file_content = // getting data from curl request here
    $page_size    = mb_strlen($file_content, '8bit');
    $content      = strip_html_tags($file_content);
    $text_size    = mb_strlen($content, '8bit');
    $content      = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", " ", $content);
    $len_real     = strlen($file_content);
    $len_strip    = strlen($content);
    return round((($len_strip / $len_real) * 100), 2);
}

Solution

  • This is using a regex.

    Update 1:

    -Have to add an atomic group around the tag body of invisible content,
    or could cause catastrophic backtracking if quotes are unbalanced.

    -Added list of invisible content it will remove:

    script, style, head, object, embed, applet, noframes, noscript, noembed

    If no closing tag, just the tag will be removed, otherwise it's content is removed with the tags.

    DEMO


    Find Raw Regex

    <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>  
    

    Replace with nothing.


    Various stringed / delimited representations

    Delimiter only:  /<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/
    Single Quote & Delimiter:  '/<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|\'[\S\s]*?\'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|\'[\S\s]*?\'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/'
    Double Quote only:  "<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"
    

    Expanded

     # <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
    
     <
     (?:
          (?:
               (?:
                    # Invisible content; end tag req'd
                    (                             # (1 start)
                         script
                      |  style
                      |  head
                      |  object
                      |  embed
                      |  applet
                      |  noframes
                      |  noscript
                      |  noembed 
                    )                             # (1 end)
                    (?:
                         \s+ 
                         (?>
                              " [\S\s]*? "
                           |  ' [\S\s]*? '
                           |  (?:
                                   (?! /> )
                                   [^>] 
                              )?
                         )+
                    )?
                    \s* >
               )
    
               [\S\s]*? </ \1 \s* 
               (?= > )
          )
    
       |  (?: /? [\w:]+ \s* /? )
       |  (?:
               [\w:]+ 
               \s+ 
               (?:
                    " [\S\s]*? " 
                 |  ' [\S\s]*? ' 
                 |  [^>]? 
               )+
               \s* /?
          )
       |  \? [\S\s]*? \?
       |  (?:
               !
               (?:
                    (?: DOCTYPE [\S\s]*? )
                 |  (?: \[CDATA\[ [\S\s]*? \]\] )
                 |  (?: -- [\S\s]*? -- )
                 |  (?: ATTLIST [\S\s]*? )
                 |  (?: ENTITY [\S\s]*? )
                 |  (?: ELEMENT [\S\s]*? )
               )
          )
     )
     >
    

    Benchmark:

    Regex1:   <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
    Options:  < none >
    Completed iterations:   3  /  3     ( x 1000 )
    Matches found per iteration:   3780
    Elapsed Time:    43.52 s,   43523.08 ms,   43523084 µs
    

    Sample Analysis, page size 126,000 bytes:

          3,780 tags / page
      x   3,000 iterations
    --------------------------
     11,340,000 total tags
      /  43.52  seconds
    --------------------------
        260,569 tags / second  
      /   3,780 tags / page
    --------------------------
       70 pages / second