Search code examples
python-3.xregexnlp

Regular expression to recognize digits written as words in Python?


It is easy to recognize numbers as digits or integers from the text but not when numbers are written as words in natural language text.

For recognizing the digits using ReGeX one can just the following regular expression.

digits_recognize = r'[0-9]+'

How can one develop a pattern to recognize digits written as numbers?


Solution

  • The following solution is applicable only to the versions after Python 3.6.

    one_to_9 = '((f(ive|our)|s(even|ix)|[tT](hree|wo)|(ni|o)ne|eight))'
    
    ten_to_19 = '((([sS](even|ix)|[fF](our|if)|[nN]ine)[tT][eE]|[eE](ighte|lev))en|[tT]((hirte)?en|welve))'
    
    two_digit_prefix = '((s(even|ix)|[tT](hir|wen)|f(if|or)|eigh|nine)ty)'
    
    one_to_99 = fr'({two_digit_prefix}([- ]{one_to_9})?|{ten_to_19}|{one_to_9})' 
    
    one_to_999 = fr'({one_to_9}[ ]hundred([ ](and[ ])?{one_to_99})?|{one_to_99})'
    
    compiled_pattern = re.compile(one_to_999)
    

    The answer is adapted from here.