Search code examples
pythonregexunicodenlpindic

Regex to add space between unicode words/numbers in python


I tried using the basic regex for unicodes but I am not able to make them work on the string with characters other than the traditional A-Z and numbers

I am looking at examples from multiple languages not part of the A-Z Alphabetical family

text = "20किटल"
res = re.sub("^[^\W\d_]+$", lambda ele: " " + ele[0] + " ", text)

Output:
20किटल

2nd try:

regexp1 = re.compile('^[^\W\d_]+$', re.IGNORECASE | re.UNICODE)
regexp1.sub("^[^\W\d_]+$", lambda ele: " " + ele[0] + " ", text)

 Output:
 20किटल


Expected output:
**20 किटल**

Solution

  • Use Pypi regex library

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    
    import regex
    
    text = "20किटल"
    pat = regex.compile(r"(?<=\d)(?=\p{L})", re.UNICODE)
    res = pat.sub(" ", text)
    print res
    

    Where \p{L} stand for any letter in any language

    Output:

    20 किटल