Search code examples
pythontokenizepython-re

Error when creating a simple custom dynamic tokenizer in Python


I am trying to create a dynamic tokenizer, but it does not work as intended.

Below is my code:

import re

def tokenize(sent):

  splitter = re.findall("\W",sent)
  splitter = list(set(splitter))

  for i in sent:
    if i in splitter:
      sent.replace(i, "<SPLIT>"+i+"<SPLIT>")

  sent.split('<SPLIT>')
  return sent


sent = "Who's kid are you? my ph. is +1-6466461022.Bye!"

tokens = tokenize(sent)

print(tokens)

This does not work!

I expected it to return the below list:

["Who", "'s", "kid", "are", "you","?", "my" ,"ph",".", "is", "+","1","-",6466461022,".","Bye","!"]

Solution

  • This would be pretty trivial if it weren't for the special treatment of the '. I'm assuming you're doing NLP, so you want to take into account which "side" the ' belongs to. For instance, "tryin'" should not be split and neither should "'tis" (it is).

    import re
    
    
    def tokenize(sent):
        split_pattern = rf"(\w+')(?:\W+|$)|('\w+)|(?:\s+)|(\W)"
        return [word for word in re.split(split_pattern, sent) if word]
    
    sent = (
        "Who's kid are you? my ph. is +1-6466461022.Bye!",
        "Tryin' to show how the single quote can belong to either side",
        "'tis but a regex thing + don't forget EOL testin'",
        "You've got to love regex"
    )
    
    for item in sent:
        print(tokenize(item))
    
    

    The python re lib evaluates patterns containing | from left to right and it is non-greedy, meaning it stops as soon as a match is found, even though it's not the longest match.

    Furthermore, a feature of the re.split() function is that you can use match groups to retain the patterns/matches you're splitting at (otherwise the string is split and the matches where the splits happen are dropped).

    Pattern breakdown:

    1. (\w+')(?:\W+|$) - words followed by a ' with no word characters immediately following it. E.g., "tryin'", "testin'". Don't capture the non-word characters.
    2. ('\w+) - ' followed by at least one word character. Will match "'t" and "'ve" in "don't" and "they've", respectively.
    3. (?:\s+) - split on any whitespace, but discard the whitespace itself
    4. (\W) - split on all non-word characters (no need to bother finding the subset that's present in the string itself)