Search code examples
pythonbeautifulsoupscrapylxmlparsel

Parsel is not able to access nested elements


I am working with Parsel. Unfortunately, I am not able to parse <a> tag, which is child of another <a> tag (I know, that <a> inside <a> isn't HTML standard). How can I handle this situation via Parsel ? I have already solved this problem using Beautiful Soup + html.parser as a backend (Beatufiul Soup + lxml does not work as well).

from parsel import Selector

html_text = '''
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <a href="#">
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </a>
    </body>
    </html>
'''

selector = Selector(text=html_text)
print(selector.xpath('//a/a')) # `<class 'parsel.selector.SelectorList'>` is an empty...

If I put <a> inside <div> everything works fine. There is an example below:

from parsel import Selector

html_text = '''
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <div>
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </div>
    </body>
    </html>
'''

selector = Selector(text=html_text)
print(selector.xpath('//div/a')) # <class 'parsel.selector.SelectorList'> is not empty...

Solution

  • The lxml.html parser that Parsel uses "fixes" the HTML code and puts the inner <a> outside. Try to specify type="xml" when instantiating the Selector:

    from parsel import Selector
    
    html_text = """
    <html>
        <head>
        <base href='http://example.com/' />
        <title>Example website</title>
        </head>
        <body>
        <a href="#">
            <a id="test" href='image1.html'>Name: My image 1 <br /></a>
            <a id="test" href='image2.html'>Name: My image 2 <br /></a>
            <a id="test" href='image3.html'>Name: My image 3 <br /></a>
            <a id="test" href='image4.html'>Name: My image 4 <br /></a>
            <a id="test" href='image5.html'>Name: My image 5 <br /></a>
        </a>
        </body>
        </html>
    """
    
    selector = Selector(text=html_text, type="xml")
    # print how the Parsel parses the document:
    # print(selector.getall()[0])
    print(selector.xpath("//a/a"))
    

    Prints:

    [
     <Selector query='//a/a' data='<a id="test" href="image1.html">Name:...'>,
     <Selector query='//a/a' data='<a id="test" href="image2.html">Name:...'>,
     <Selector query='//a/a' data='<a id="test" href="image3.html">Name:...'>,
     <Selector query='//a/a' data='<a id="test" href="image4.html">Name:...'>,
     <Selector query='//a/a' data='<a id="test" href="image5.html">Name:...'>
    ]