I've been trying to get a full text hosted inside a <div>
element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code
for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
print(div)
returns response
Element div at 0x15480d93ac8
But when I'm trying to get the full text itself by using method div.text
, it returns None
Which is a strange result, I think.
What should I do?
Any help would be greatly appreciated.
As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.
This is one of these strange things that happens when xpath is handled by a host language and library. When you use the xpath expression
.//div[contains(text(), "Арбитраж")]
the search is performed according to xpath rules, which considers the target text as contained within the target div
.
When you go on to the next line:
print(div.text)
you are using lxml.html, which apparently doesn't regard the target text as part of the div
text, because it's preceded by the <i>
tag. To get to it, with lxml.html, you have to use:
print(div.text_content())
or with xpath only:
print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])
It seems lxml.etree and beautifulsoup use different approaches. See this interesting discussion here.