I am working with Parsel. Unfortunately, I am not able to parse <a>
tag, which is child of another <a>
tag (I know, that <a>
inside <a>
isn't HTML
standard). How can I handle this situation via Parsel
? I have already solved this problem using Beautiful Soup
+ html.parser
as a backend (Beatufiul Soup
+ lxml
does not work as well).
from parsel import Selector
html_text = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<a href="#">
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</a>
</body>
</html>
'''
selector = Selector(text=html_text)
print(selector.xpath('//a/a')) # `<class 'parsel.selector.SelectorList'>` is an empty...
If I put <a>
inside <div>
everything works fine. There is an example below:
from parsel import Selector
html_text = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div>
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</div>
</body>
</html>
'''
selector = Selector(text=html_text)
print(selector.xpath('//div/a')) # <class 'parsel.selector.SelectorList'> is not empty...
The lxml.html
parser that Parsel
uses "fixes" the HTML code and puts the inner <a>
outside. Try to specify type="xml"
when instantiating the Selector
:
from parsel import Selector
html_text = """
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<a href="#">
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</a>
</body>
</html>
"""
selector = Selector(text=html_text, type="xml")
# print how the Parsel parses the document:
# print(selector.getall()[0])
print(selector.xpath("//a/a"))
Prints:
[
<Selector query='//a/a' data='<a id="test" href="image1.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image2.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image3.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image4.html">Name:...'>,
<Selector query='//a/a' data='<a id="test" href="image5.html">Name:...'>
]