Search code examples
pythonregexstringreplacepython-re

How to replace text between multiple tags based on character length


I am dealing with dirty text data (and not with valid html). I am doing natural language processing and short code snippets shouldn't be removed because they can contain valuable information while long code snippets don't.

Thats why I would like to remove text between code tags only if the content that will be removed has character length > n.

Let's say the number of allowed characters between two code tags is n <= 5. Then everything between those tags that is longer than 5 characters will be removed.

My approach so far deletes all of the code characters:

text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
text = re.sub("<code>.*?</code>", '', text)
print(text)

Output: This is a string  another string  another string  another string.

The desired output:

"This is a string <code>1234</code> another string <code>123</code> another string another string."

Is there a way to count the text length for all of the appearing <code ... </code> tags before it will actually be removed?


Solution

  • In Python, BeautifulSoup is often used to manipulate HTML/XML contents. If you use this library, you can use something like

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(content,"html.parser")
    text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
    soup = BeautifulSoup(text,"html.parser")
    for code in soup.find_all("code"):
        if len(code.encode_contents()) > 5: # Check the inner HTML length
            code.extract()                  # Remove the node found
    
    print(str(soup))
    # => This is a string <code>1234</code> another string <code>123</code> another string  another string.
    

    Note that here, the length of the inner HTML part is taken into account, not the inner text.

    With regex, you can use a negated character class pattern, [^<], to match any char other than <, and apply a limiting quantifier to it. If all longer than 5 chars should be removed, use {6,} quantifier:

    import re
    text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
    text = re.sub(r'<code>[^>]{6,}</code>', '', text)
    print(text)
    # => This is a string <code>1234</code> another string <code>123</code> another string  another string.
    

    See this Python demo.