Search code examples
pythonregexmatchpython-re

How can I match a pattern, and then everything upto that pattern again? So, match all the words and acronyms in my below example


Context

I have the following paragraph:

text = """
בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא
בגו"ר  - בגשמיות ורוחניות ה"א - ה' אלוקיכם התמי' - התמיהה
בהנ"ל - בהנזכר לעיל ה"א - ה' אלקיך ואח"כ - ואחר כך
בהשי״ת - בהשם יתברך ה"ה - הרי הוא / הוא הדין ואת"ה - ואיגוד תלמידי 
"""

this paragraph is combined with Hebrew words and their acronyms.

A word contains quotation marks (").

So for example, some words would be:

[
    'בביהכנ"ס',
     'דו"ח',
     'הת"ד'
 ]

Now, I'm able to match all the words with this regex:

(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)

enter image description here

Question

But how can I also match all the corresponding acronyms as a separate group? (the acronyms are what's not matched, so not the green in the picture).

Example acronyms are:

[
    'בבית הכנסת',
    'דין וחשבון',
    'התיקוני דיקנא'
]

Expected output

The expected output should be a dictionary with the Words as keys and the Acronyms as values:

{
    'בביהכנס': 'בבית הכנסת',
    'דו"ח': 'דין וחשבון',
    'הת"ד': 'התיקוני דיקנא'
}

My attempt

What I tried was to match all the words (as above picture):

(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)

and then match everything until the pattern appears again with .*\1, so the entire regex would be:

(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b).*\1

But as you can see, that doesn't work:

enter image description here

  • How can I match the words and acronyms to compose a dictionary with the words/acronyms?

Note

When you print the output, it might be printed in Left-to-right order. But it should really be from Right to left. So if you want to print from right to left, see this answer:

right-to-left languages in Python


Solution

  • You can do something like this:

    import re
    
    pattern = r'(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)\s*-\s*([^"]+)(\s|$)'
    
    text = """בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא"""
    
    for word, acronym, _ in re.findall(pattern, text):
        print(word + ' == ' + acronym)
    

    which outputs

    בביהכנ"ס == בבית הכנסת
    דו"ח == דין וחשבון
    הת"ד == התיקוני דיקנא
    

    Let's take a closer look how I built the regex pattern. Here's the pattern from your question that matches words:

    (\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)

    This part will match the delimiter between a word and it's acronym: \s*-\s* (spaces then dash then spaces)

    This part will match anything except for double quote: ([^"]+)

    Finally, not to match the beginning of the next word let's match space/EOL in the end: (\s|$).

    Concatenate all the parts above and you'll get my pattern: (\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)\s*-\s*([^"]+)(\s|$)

    re.findall() will return a list of tuples, one tuple for one match. Each tuple will contain strings matching the groups (the stuff within parenthesis) in the same order that groups appear in the pattern. So we need group number 0 (word) and group number 1 (acronym) to build our dict. Group number 2 is not needed.