Search code examples
pythonregexnlp

Tokenize characters except when encapsulated by brackets and keep brackets


I am parsing my keystroke data. It looks something like this:

> key_data = 'stuff[up][left][return]end'

I want to tokenize the characters, but treat the modifiers surrounded by [] as a single token.

> print(key_tokens)
['s','t','u','f','f','[up]','[left]','[return]','e','n','d']

I know I can do something like this to find the encapsulated sections:

> key_tokens = re.split(r'([\[\]])', key_data)
> print(key_tokens)
['stuff','[','up',']','[','left',']','[','return',']','end']

I can also of course do something like this to separate each character:

> key_tokens = [c for c in key_data]
> print(key_tokens)
['s','t','u','f','f','[','u','p',']','[','l','e','f','t',']','[','r','e','t','u','r','n',']','e','n','d']

I am just having trouble putting it all together.

Edit: Now I am seeing a corner case where the opening square bracket is used as text. Unfortunately, it is not escaped or anything.

> key_data = 'stuff[but[up][left][return]end'
> key_tokens = re.findall('\[.*?\]|.', key_data)
> print(key_tokens)
['s','t','u','f','f','[but[up]','[left]','[return]','e','n','d']

What I want to see is:

> print(key_tokens)
['s','t','u','f','f','[','b','u','t','[up]','[left]','[return]','e','n','d']

Solution

  • If you don't mind using re.findall, instead of re.split, you can first try to match the pattern for anything inside squared bracket using \[.*?\], if not, then you can just take a single character that's what |. is doing, it will match 1-length any character, if you just have word characters (i.e. alphabets) as you have in sample data, you can consider using |\w:

    >>> re.findall('\[.*?\]|.', key_data)
    
    ['s', 't', 'u', 'f', 'f', '[up]', '[left]', '[return]', 'e', 'n', 'd']
    

    For updated question: If that's the case, you may consider using alternatives to regex since it is not so good at handling these types of nesting, since value needs to compared back and forth. Here is a non-regex solution:

    result = []
    idx = 0
    while True:
        c = key_data[idx]
        if c != '[':
            idx += 1
            result.append(c)  #Append if not [
        else:
            closingIndex = key_data[idx+1:].find(']') # find if ] exist after current [
            if closingIndex == -1:
                #append the rest sub-srting and break since no ] after current [
                result.extend(key_data[idx:])
                break
            else:
                # Check if [ in the  middle, append only c if True
                if '[' in key_data[idx+1:idx+closingIndex+2]:
                    result.append(c)
                    idx += 1
                else:
                    #Extend from [ to the nearest ]
                    result.append(key_data[idx:idx+closingIndex+2])
                    idx += closingIndex+2
        if idx>=len(key_data): break  #Break loop if idx exceeds maximum value