Search code examples
phphtmlstringtagsphp-5.3

Truncate string with HTML tags in it


I have a string which contains HTML tags. I'm looking for a piece of code that would let me truncate this string to:

  • have 100 characters length,
  • contain no image tags (<img />).
  • include other HTML tags (except image tag),
  • that 100 characters lenght should not include white spaces and HTML tags characters.

For example, the string is:

<img>Something</img><b>Just an Example</b> Plain Text <br><a href="#">stackoverflow</a>

So the result should be:

Just an Example Plain Text stackoverflow (its a link).

As a result we have around 35 words (except white-space).

I tried solution from this question, but didn't get required result. Any help would be appreciated.


Solution

  • How about a function. Here's mine -- AbstractHTMLContents. It has two parameters:

    • input HTML content,
    • limit.

    Here's the code:

    function AbstractHTMLContents($html, $maxLength=100){
        mb_internal_encoding("UTF-8");
        $printedLength = 0;
        $position = 0;
        $tags = array();
        $newContent = '';
    
        $html = $content = preg_replace("/<img[^>]+\>/i", "", $html);
    
        while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
        {
            list($tag, $tagPosition) = $match[0];
            // Print text leading up to the tag.
            $str = mb_strcut($html, $position, $tagPosition - $position);
            if ($printedLength + mb_strlen($str) > $maxLength){
                $newstr = mb_strcut($str, 0, $maxLength - $printedLength);
                $newstr = preg_replace('~\s+\S+$~', '', $newstr);  
                $newContent .= $newstr;
                $printedLength = $maxLength;
                break;
            }
            $newContent .= $str;
            $printedLength += mb_strlen($str);
            if ($tag[0] == '&') {
                // Handle the entity.
                $newContent .= $tag;
                $printedLength++;
            } else {
                // Handle the tag.
                $tagName = $match[1][0];
                if ($tag[1] == '/') {
                  // This is a closing tag.
                  $openingTag = array_pop($tags);
                  assert($openingTag == $tagName); // check that tags are properly nested.
                  $newContent .= $tag;
                } else if ($tag[mb_strlen($tag) - 2] == '/'){
              // Self-closing tag.
                $newContent .= $tag;
            } else {
              // Opening tag.
              $newContent .= $tag;
              $tags[] = $tagName;
            }
          }
    
          // Continue after the tag.
          $position = $tagPosition + mb_strlen($tag);
        }
    
        // Print any remaining text.
        if ($printedLength < $maxLength && $position < mb_strlen($html))
          {
            $newstr = mb_strcut($html, $position, $maxLength - $printedLength);
            $newstr = preg_replace('~\s+\S+$~', '', $newstr);
            $newContent .= $newstr;
          }
    
        // Close any open tags.
        while (!empty($tags))
          {
            $newContent .= sprintf('</%s>', array_pop($tags));
          }
    
        return $newContent;
    }
    

    It seems, it gives result expected by you.