Search code examples
rustservohtml5ever

How do I parse a page with html5ever, modify the DOM, and serialize it?


I would like to parse a web page, insert anchors at certain positions and render the modified DOM out again in order to generate docsets for Dash. Is this possible?

From the examples included in html5ever, I can see how to read an HTML file and do a poor man's HTML output, but I don't understand how I can modify the RcDom object I retrieved.

I would like to see a snippet inserting an anchor element (<a name="foo"></a>) to an RcDom.

Note: this is a question regarding Rust and html5ever specifically ... I know how to do it in other languages or simpler HTML parsers.


Solution

  • Here is some code that parses a document, adds an achor to the link and prints the new document:

    extern crate html5ever;
    
    use html5ever::{ParseOpts, parse_document};
    use html5ever::tree_builder::TreeBuilderOpts;
    use html5ever::rcdom::RcDom;
    use html5ever::rcdom::NodeEnum::Element;
    use html5ever::serialize::{SerializeOpts, serialize};
    use html5ever::tendril::TendrilSink;
    
    fn main() {
        let opts = ParseOpts {
            tree_builder: TreeBuilderOpts {
                drop_doctype: true,
                ..Default::default()
            },
            ..Default::default()
        };
        let data = "<!DOCTYPE html><html><body><a href=\"foo\"></a></body></html>".to_string();
        let dom = parse_document(RcDom::default(), opts)
            .from_utf8()
            .read_from(&mut data.as_bytes())
            .unwrap();
    
        let document = dom.document.borrow();
        let html = document.children[0].borrow();
        let body = html.children[1].borrow(); // Implicit head element at children[0].
    
        {
            let mut a = body.children[0].borrow_mut();
            if let Element(_, _, ref mut attributes) = a.node {
                attributes[0].value.push_tendril(&From::from("#anchor"));
            }
        }
    
        let mut bytes = vec![];
        serialize(&mut bytes, &dom.document, SerializeOpts::default()).unwrap();
        let result = String::from_utf8(bytes).unwrap();
        println!("{}", result);
    }
    

    This prints the following:

    <html><head></head><body><a href="foo#anchor"></a></body></html>
    

    As you can see, we can navigate through the child nodes via the children attribute.

    And we can change an attribute present in the vector of attributes of an Element.