Search code examples
htmlrusthtml5everkuchiki

How to get all text of a HTML document (except script/style/noscript tags) using Kuchiki?


I'm trying to get all the text on a HTML page, except for non-visible text (example: I don't want text inside script/style/noscript tags).

Here's what I've come up with so far:

let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
    if let Some(el) = child.as_element() {
        let tag_name = &el.name.local;
        if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
            child.detach();
        }
    }
}
let text = parser.text_contents();
println!("{}", text);

The idea is that the 1st pass will remove any script, style, or noscript tags. And then I can call text_contents to get the visible text.

However, it seems like text_contents is still returning inline Javascript.

Am I mis-understanding the Kuchiki/html5ever API?


Solution

  • The inclusive_descendants() iterator doesn't seem to like iterating over nodes and detaching them.

    Given the following:

    Cargo.toml

    [dependencies]
    kuchiki = "0.8.1"
    

    main.rs

    use kuchiki::traits::TendrilSink;
    
    let content = "\
        <html>\
        <head></head>\
        <body>\
            <div>div </div>\
            <script type='text/javascript'>script </script>\
            <noscript>noscript </noscript>\
            <span>span</span>\
        </body>\
        </html>";
    
    let parser = kuchiki::parse_html().one(content);
    
    for child in parser.inclusive_descendants() {
        if let Some(el) = child.as_element() {
            println!("{}", el.name.local);
        }
    }
    
    // println!("{}", parser.text_contents());
    

    We do get all nodes:

    html
    head
    body
    div
    script
    noscript
    span
    

    When using text_contents() after iterating over them and detaching them like above, the iterator seems to lose track after the first detached node:

    div noscript span
    

    It doesn't seem to depend on the type of tag either, as switching the order of the <noscript> and <script> tag around gives us:

    div script span
    

    I found that detaching the nodes after collecting them first does seem to work:

    parser
        .inclusive_descendants()
        .filter(|node| {
            node.as_element().map_or(false, |e| {
                matches!(e.name.local.as_ref(), "script" | "style" | "noscript")
            })
        })
        .collect::<Vec<_>>()
        .iter()
        .for_each(|node| node.detach());
    
    println!("{}", parser.text_contents());
    
    div span