Character count of all characters in a HTML string, but measuring only 20 visible words

I am working on a WordPress site where one of the pages lists excerpts about corporate clients.

Let's say I have a web page where the visible text looks like this:

"SuperAmazing.com, a subsidiary of Amazing, the leading provider of
integrated messaging and collaboration services, today announced the
availability of an enhanced version of its Enterprise Messaging
Service (CMS) 2.0, a lower cost webmail alternative to other business
email solutions such as Microsoft Exchange, GroupWise and LotusNotes
offerings."

But let's say there can be an HTML link or image in this text, so the raw HTML might look like this:

<img src="/images/corporate/logos/super_amazing.jpg" alt="Company
logo for SuperAmazing.com" /> SuperAmazing.com, a subsidiary of
<a href="http://www.amazing.com/">Amazing</a>, the leading
provider of integrated messaging and collaboration services, today
announced the availability of an enhanced version of its Enterprise
Messaging Service (CMS) 2.0, a lower cost webmail alternative to other
business email solutions such as Microsoft Exchange, GroupWise and
LotusNotes offerings."

Here is what I need to do: find out if there is a link inside of the first 20 visible words.

These are first 20 visible words:

"SuperAmazing.com, a subsidiary of Amazing, the leading provider of
integrated messaging and collaboration services, today announced the
availability of an"

I need to get the character count, including the HTML, out to the 20 visible word, which in this case would be "an", though of course it'll be different for each excerpt on the page.

(I'm willing to count "SuperAmazing.com" as 2 words if that makes things easier.)

I tried number of regular expressions for counting words, but they all count the HTML, not the visible words.

So what would be the correct regular expression for finding the full character count, including the HTML, for the first 20 visible words?

Solution

Here's a reasonably good regex for matching the first twenty visible words:

'~^(?:\s*+(?:(?:[^<>\s]++|</?\w[^<>]*+>)++)){1,20}~'

This matches one to twenty whitespace-separated tokens, where a token is defined as one or more words or tags not separated by whitespace (where a "word" is defined as one or more characters other than whitespace or angle brackets). For example, this would be one token:

<a href="http://www.amazing.com/">Amazing</a>

...but this is two tokens:

<a href="http://www.superduper.com/">Super Duper</a>

This will treat a standalone tag (like the <img> tag in your example, or any tag that's surrounded by whitespace) as a separate token, which throws off the count--it only matches up to the word "of" in your example. It also won't correctly handle <br> tags, or block-level tags like <p> and <table>, if they don't have any whitespace around them. Only you can know how much of a problem that will be.

EDIT: If that isolated <img> tag is something you see a lot, you could preprocess the text to remove the whitespace following it. That would effectively merge it with the first subsequent "real" token, resulting in a more accurate character count. I know it only changes the count by one or two characters in this case, but if the twentieth word happened to "supercalifragilisticexpialidocious" you'd probably notice the difference. :)