I am trying to get Text to HTML Ratio on a given webpage. I am using a strip_html_tags
to strip out the html tags and comparing it to the original content on the page to get the ratio. My issue is that I feel like my strip_html_tags
function may not get all the tags on webpage. Is there a better way to do this... maybe that just replaces everything that starts with < and >. I can already point out that I am missing a lot of tags that should be stripped in the regex but there has to be a better way to do all this.
function strip_html_tags($text)
{
$text = preg_replace(array(
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
'#<[\/\!]*?[^<>]*?>#siu', // Strip out HTML tags
'#<![\s\S]*?--[ \t\n\r]*>#siu' // Strip multi-line comments including CDATA
), array(
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0"
), $text);
return strip_tags($text);
}
function check_ratio($url)
{
$file_content = // getting data from curl request here
$page_size = mb_strlen($file_content, '8bit');
$content = strip_html_tags($file_content);
$text_size = mb_strlen($content, '8bit');
$content = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", " ", $content);
$len_real = strlen($file_content);
$len_strip = strlen($content);
return round((($len_strip / $len_real) * 100), 2);
}
This is using a regex.
Update 1:
-Have to add an atomic group around the tag body of invisible content,
or could cause catastrophic backtracking if quotes are unbalanced.
-Added list of invisible content it will remove:
script, style, head, object, embed, applet, noframes, noscript, noembed
If no closing tag, just the tag will be removed, otherwise it's content is removed with the tags.
Find Raw Regex
<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Replace with nothing.
Various stringed / delimited representations
Delimiter only: /<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/
Single Quote & Delimiter: '/<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|\'[\S\s]*?\'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|\'[\S\s]*?\'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/'
Double Quote only: "<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"
Expanded
# <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| head
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
[\S\s]*? </ \1 \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)+
\s* /?
)
| \? [\S\s]*? \?
| (?:
!
(?:
(?: DOCTYPE [\S\s]*? )
| (?: \[CDATA\[ [\S\s]*? \]\] )
| (?: -- [\S\s]*? -- )
| (?: ATTLIST [\S\s]*? )
| (?: ENTITY [\S\s]*? )
| (?: ELEMENT [\S\s]*? )
)
)
)
>
Benchmark:
Regex1: <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Options: < none >
Completed iterations: 3 / 3 ( x 1000 )
Matches found per iteration: 3780
Elapsed Time: 43.52 s, 43523.08 ms, 43523084 µs
Sample Analysis, page size 126,000 bytes
:
3,780 tags / page
x 3,000 iterations
--------------------------
11,340,000 total tags
/ 43.52 seconds
--------------------------
260,569 tags / second
/ 3,780 tags / page
--------------------------
70 pages / second