Search code examples
htmlregexdomxhtmldom-traversal

How would you abbriviate XHTML to an arbitrary number of words?


How would you programmacially abbreviate XHTML to an arbitrary number of words without leaving unclosed or corrupted tags?

i.e.

<p>
    Proin tristique dapibus neque. Nam eget purus sit amet leo
    tincidunt accumsan.
</p>
<p>
    Proin semper, orci at mattis blandit, augue justo blandit nulla.
    <span>Quisque ante congue justo</span>, ultrices aliquet, mattis eget,
    hendrerit, <em>justo</em>.
</p>

Abbreviated to 25 words would be:

<p>
    Proin tristique dapibus neque. Nam eget purus sit amet leo
    tincidunt accumsan.
</p>
<p>
    Proin semper, orci at mattis blandit, augue justo blandit nulla.
    <span>Quisque ante congue...</span>
</p>

Solution

  • Recurse through the DOM tree, keeping a word count variable up to date. When the word count exceeds your maximum word count, insert "..." and remove all following siblings of the current node, then, as you go back up through the recursion, remove all the following siblings of each of its ancestors.