Search code examples
regexpython-3.xfarsi

Farsi regular expression in Python


I have this Python regular expression code in Python 3 that I do not understand. I appreciate any help to explain what exactly it does with a few examples. The code is this:

# encoding=utf-8
import re
newline = re.sub(r'\s+(((زا(ی)?)?|ام?|ات|اش|ای?(د)?|ایم?|اند?)[\.\!\?\،]*)', r'\1 ', newline)

Solution

  • here is your regular expression:

    \s+(((زا(ی)?)?|ام?|ات|اش|ای?(د)?|ایم?|اند?)[\.\!\?\،]*)
    

    and here is a visualization:

    Regular expression visualization

    Debuggex Demo

    Your replacement is r'\1 ' which means replace what you found with the 1st group followed by space. I don't read farsi, but here is another example:

    \s+((a|b)[./?]*)
    

    Regular expression visualization

    Debuggex Demo

    so let's execute some code:

    >>> newline = '     a?    b?        a.'
    >>> re.sub('\s+((a|b)[./?]*)', r'\1 ', newline)
    'a? b? a. '
    

    This eats extra spaces preceding a particular group of characters (the leading \s+) and changes it to the identified group 1 followed by one space (r'\1 ').