Search code examples
pythonregexnlpdata-cleaning

Seperating a python string by character while keeping inline tags intact


I'm trying to make a custom tokenizer in python that works with inline tags. The goal is to take a string input like this:

'This is *tag1* a test *tag2*.'

and have it output the a list separated by tag and character:

['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ',  'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']

without the tags, I would just use list(), and I think I found a solution for dealing with as single tag type, but there are multiple. There are also other multi character segments, such as ellipses, that are supposed to be encoded as a single feature.
One thing I tried is replacing the tag with a single unused character with regex and then using list() on the string:

text = 'This is *tag1* a test *tag2*.'
tidx = re.match(r'\*.*?\*', text)
text = re.sub(r'\*.*?\*', r'#', text)
text = list(text)

then I would iterate over it and replace the '#' with the extracted tags, but I have multiple different features I am trying to extract, and reiterating the process multiple times with different placeholder characters before splitting the string seems like poor practice. Is there any easier way to do something like this? I'm still quite new to this so there are still a lot of common methods I am unaware of. I guess I can also use a larger regex expression that encompasses all of the features i'm trying to extract but it still feels hacky, and I would prefer to use something more modular that can be used to find other features without writing a new expression every time.


Solution

  • You can use the following regex with re.findall:

    \*[^*]*\*|.
    

    See the regex demo. The re.S or re.DOTALL flag can be used with this pattern so that . could also match line break chars that it does not match by default.

    Details

    • \*[^*]*\* - a * char, followed with zero or more chars other than * and then a *
    • | - or
    • . - any one char (with re.S).

    See the Python demo:

    import re
    s = 'This is *tag1* a test *tag2*.'
    print( re.findall(r'\*[^*]*\*|.', s, re.S) )
    # => ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', '*tag1*', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', '*tag2*', '.']