Search code examples
pythonregexregex-groupregex-greedy

regex - error with result - too many occurence


For example, I have this texte (without newline - it's important) :

<div> ffjdklfjdklfjs 2015 ddddd </div> sfsfsfsfsf    <div> hkh/ <> -%=:;.éggggggggggg 2018  dsqkdlmqs </div> fdfdfd     </div><div> ffjdklfjdklfjs 2023 ddddd </div> sfsfsfsfsf    <div> hkh/ <> -%=:;.éjhjk 2018 / dsqkdlmqs </div> fdfdfd     </div>

I'll would like a regex in order to find all sequence with only texte between <div>...2018....</div> for only 2018 date and not others.

The resultat must be 2 matchs :

<div>hkh/ <> -%=:;.éggggggggggg 2018  dsqkdlmqs </div>
<div>hkh/ <> -%=:;.éjhjk 2018 / dsqkdlmqs </div>

I made this regex (I code with Python) :

r"<div>(?=.*?2018).*?<\/div>" /g

But It doesn't work. The result is 4 matchs :

<div> ffjdklfjdklfjs 2015 ddddd </div>
<div> hkh/ <> -%=:;.éggggggggggg  2018  dsqkdlmqs </div>
<div> ffjdklfjdklfjs 2023 ddddd </div>
<div> hkh/ <> -%=:;.éjhjk 2018 / dsqkdlmqs </div>

I don't want to select <div> ffjdklfjdklfjs 2015 ddddd </div> and not <div> ffjdklfjdklfjs 2023 ddddd </div> but I don't find the solution :(


Solution

  • try this code:

    import re
    
    text = """<div> ffjdklfjdklfjs 2023 ddddd </div> sfsfsfsfsf
         <div> hkhjhjk 2018 / dsqkdlmqs </div> fdfdfd     </div>"""
    
    result = re.search(r'(?=<div)(?=.*?2018)[\s\S]*?(?:<\/div>)', text)
    print(result[0])  # <div> hkhjhjk 2018 / dsqkdlmqs </div>
    

    =================

    Edit:

        import re
        
        text1 = """<div> ffjdklfjdklfjs 2023 ddddd </div> sfsfsfsfsf
             <div> hkhjhjk 2018 / dsqkdlmqs </div> fdfdfd     </div>"""
    
        text2 = """<div> ffjdklfjdklfjs 2023 ddddd </div> sfsfsfsfsf
             <div> hk
    hjhjk 2018 / dsqkdlmqs </div> fdfdfd     </div>"""
        
        reg = re.compile(r'<div>(?=[^<]*?2018)[\s\S]*?<\/div>')
    
        result1 = reg.search(text1)
        print(result1[0])  # <div> hkhjhjk 2018 / dsqkdlmqs </div>
        result2 = reg.search(text2)
        print(result2[0])  # <div> hk\nhjhjk 2018 / dsqkdlmqs </div>