Search code examples
pythonregexunicode-stringtamil

identifying if the character is a digit or Unicode character within a word in python


I want to find if a word contains digit and characters and if so separate the digit part and the character part. I want to check for tamil words, ex: ரூ.100 or ரூ100. I want to seperate the ரூ. and 100, and ரூ and 100. How do i do it in python. I tried like this:

    for word in f.read().strip().split(): 
      for word1, word2, word3 in zip(word,word[1:],word[2:]): 
        if word1 == "ர" and word2 == "ூ " and word3.isdigit(): 
           print word1 
           print word2 
        if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2): 
           print word1 print word2

Solution

  • You can use (.*?)(\d+)(.*) regular expression, that will save 3 groups: everything before digits, digits and everything after:

    >>> import re
    >>> pattern = ur'(.*?)(\d+)(.*)'
    >>> s = u"ரூ.100"
    >>> match = re.match(pattern, s, re.UNICODE)
    >>> print match.group(1)
    ரூ.
    >>> print match.group(2)
    100
    

    Or, you can unpack matched groups into variables, like this:

    >>> s = u"100ஆம்"
    >>> match = re.match(pattern, s, re.UNICODE)
    >>> before, digits, after = match.groups()
    >>> print before
    
    >>> print digits
    100
    >>> print after
    ஆம்
    

    Hope that helps.