How to replace text between multiple tags based on character length

I am dealing with dirty text data (and not with valid html). I am doing natural language processing and short code snippets shouldn't be removed because they can contain valuable information while long code snippets don't.

Thats why I would like to remove text between code tags only if the content that will be removed has character length > n.

Let's say the number of allowed characters between two code tags is n <= 5. Then everything between those tags that is longer than 5 characters will be removed.

My approach so far deletes all of the code characters:

text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
text = re.sub("<code>.*?</code>", '', text)
print(text)

Output: This is a string  another string  another string  another string.

The desired output:

"This is a string <code>1234</code> another string <code>123</code> another string another string."

Is there a way to count the text length for all of the appearing <code ... </code> tags before it will actually be removed?

Solution

In Python, BeautifulSoup is often used to manipulate HTML/XML contents. If you use this library, you can use something like

from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"html.parser")
text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
soup = BeautifulSoup(text,"html.parser")
for code in soup.find_all("code"):
    if len(code.encode_contents()) > 5: # Check the inner HTML length
        code.extract()                  # Remove the node found

print(str(soup))
# => This is a string <code>1234</code> another string <code>123</code> another string  another string.

Note that here, the length of the inner HTML part is taken into account, not the inner text.

With regex, you can use a negated character class pattern, [^<], to match any char other than <, and apply a limiting quantifier to it. If all longer than 5 chars should be removed, use {6,} quantifier:

import re
text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
text = re.sub(r'<code>[^>]{6,}</code>', '', text)
print(text)
# => This is a string <code>1234</code> another string <code>123</code> another string  another string.

See this Python demo.