Search code examples
pythonregexparsingweb-scrapinghtml-parsing

Need Assistance with a regex pattern in Python – Parsing complex HTML structures


I'm trying to parse complex HTML structures using Python's re module, and I've run into a roadblock with my regex pattern. Here's what I'm trying to do:

I have HTML text that contains nested elements, and I want to extract the content of the innermost tags. However, I can't seem to get my regex pattern right. Here's the code I'm using:

import re

html_text = """
<div>
    <div>
        <div>
            Innermost Content 1
        </div>
    </div>
    <div>
        Innermost Content 2
    </div>
</div>
"""

pattern = r'<div>(.*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)

print(result)

I expected this code to return the content of the innermost elements, like this:

['Innermost Content 1', 'Innermost Content 2']

But it's not working as expected. What am I doing wrong with my regex pattern, and how can I fix it to achieve the desired result? Any help would be greatly appreciated!


Solution

  • Try this modified code with changed pattern and an extra line to get rid of the \n

    import re
    
    html_text = """
    <div>
        <div>
            <div>
                Innermost Content 1
            </div>
        </div>
        <div>
            Innermost Content 2
        </div>
    </div>
    """
    
    pattern = r'<div>([^<]*?)<\/div>'
    result = re.findall(pattern, html_text, re.DOTALL)
    
    result = [content.strip() for content in result if content.strip()]
    
    print(result)