I'm trying to get all the text on a HTML page, except for non-visible text (example: I don't want text inside script/style/noscript tags).
Here's what I've come up with so far:
let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
if let Some(el) = child.as_element() {
let tag_name = &el.name.local;
if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
child.detach();
}
}
}
let text = parser.text_contents();
println!("{}", text);
The idea is that the 1st pass will remove any script
, style
, or noscript
tags. And then I can call text_contents
to get the visible text.
However, it seems like text_contents
is still returning inline Javascript.
Am I mis-understanding the Kuchiki/html5ever API?
The inclusive_descendants()
iterator doesn't seem to like iterating over nodes and detaching them.
Given the following:
Cargo.toml
[dependencies]
kuchiki = "0.8.1"
main.rs
use kuchiki::traits::TendrilSink;
let content = "\
<html>\
<head></head>\
<body>\
<div>div </div>\
<script type='text/javascript'>script </script>\
<noscript>noscript </noscript>\
<span>span</span>\
</body>\
</html>";
let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
if let Some(el) = child.as_element() {
println!("{}", el.name.local);
}
}
// println!("{}", parser.text_contents());
We do get all nodes:
html
head
body
div
script
noscript
span
When using text_contents()
after iterating over them and detaching them like above, the iterator seems to lose track after the first detached node:
div noscript span
It doesn't seem to depend on the type of tag either, as switching the order of the <noscript>
and <script>
tag around gives us:
div script span
I found that detaching the nodes after collecting them first does seem to work:
parser
.inclusive_descendants()
.filter(|node| {
node.as_element().map_or(false, |e| {
matches!(e.name.local.as_ref(), "script" | "style" | "noscript")
})
})
.collect::<Vec<_>>()
.iter()
.for_each(|node| node.detach());
println!("{}", parser.text_contents());
div span