Search code examples
pythonnlpnltktokenize

Regular expression tokenization with numbers?


I am expecting the following code; tokenize

this is an example 123

into

['this', 'is', 'an', 'example 123'] 

but it doesn't see numbers part of the word. Any suggestion?

import re
from nltk.tokenize import RegexpTokenizer
pattern=re.compile(r"[\w\s\d]+")
tokenizer_number=RegexpTokenizer(pattern)
tokenizer_number.tokenize("this is an example 123")

Solution

  • A pretty well formed regex :

    [\d.,]+|[A-Z][.A-Z]+\b\.*|\w+|\S
    

    This topic was solved before in : Here!

    ,You can test regex interactively with https://regex101.com