Search code examples
web-scrapingrusthtml5ever

How do I parse a page with html5ever and find all the links?


I would like to parse HTML in the string format with html5ever and find all the links in this HTML. I'm aware of How do I parse a page with html5ever, modify the DOM, and serialize it?, however RcDom does not exist anymore.


Solution

  • You have to create a struct which implements TokenSink, then create a new Tokenizer where your struct is the sink. When parsing with Tokenizer::feed(), it will pass all tokens through your TokenSink.

    This code was adapted from the html5ever examples, so it is Apache/MIT licensed. html5ever is a complicated library built for browsers, and it shows - the API appears to be designed to accommodate encodings other than UTF-8.

    This code only parses from stdin. If you want to use it as is, pipe curl like so curl https://stackoverflow.com/questions/59461279/how-do-i-parse-a-page-with-html5ever-and-find-all-the-links | cargo run. When I do this, I get output like

    link to: #
    link to: https://stackoverflow.com
    link to: #
    link to: /teams/customers
    ...
    
    extern crate html5ever;
    
    use std::default::Default;
    use std::io;
    
    use html5ever::tendril::*;
    use html5ever::tokenizer::BufferQueue;
    use html5ever::tokenizer::{StartTag, TagToken};
    use html5ever::tokenizer::{Token, TokenSink, TokenSinkResult, Tokenizer, TokenizerOpts,};
    use html5ever::interface::QualName;
    use html5ever::{ns, namespace_url, LocalName};
    
    #[derive(Copy, Clone)]
    struct TokenPrinter {}
    
    impl TokenSink for TokenPrinter {
        type Handle = ();
    
        fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
            let link_name = QualName::new(
                None,
                ns!(),
                LocalName::from("href"),
            );
            match token {
                TagToken(tag) => {
                    if tag.kind == StartTag && tag.name.to_string()=="a" {
                        let attrs = tag.attrs;
                        for attr in attrs {
                            if attr.name == link_name {
                                println!("link to: {}", attr.value);
                            }
                        }
                    }
                },
                _ => {
                },
            }
            TokenSinkResult::Continue
        }
    }
    
    fn main() {
        let sink = TokenPrinter {};
        let mut chunk = ByteTendril::new();
        io::stdin().read_to_tendril(&mut chunk).unwrap();
        let mut input = BufferQueue::new();
        input.push_back(chunk.try_reinterpret::<fmt::UTF8>().unwrap());
    
        let mut tok = Tokenizer::new(
            sink,
            TokenizerOpts::default(),
        );
        let _ = tok.feed(&mut input);
        assert!(input.is_empty());
        tok.end();
    }