Search code examples
multithreadingrustweb-crawlerspawn

Rust - best way to share a hashset in a structure between multiple workers


I'm pretty new to Rust and I was trying to port a Go web crawler I made to Rust. In Go I created a hashmap which was used (and shared) by multiple workers (go routines spawning the same function). That was easily solvable using Mutexes, but I can't grasp how to do the same in Rust.

The Crawler structure is:

struct Crawler {
    client: reqwest::Client,
    target: String,
    visited: Arc<Mutex<HashSet<String>>>,
    queue: Arc<Mutex<Queue<String>>>,
    base_url: String,
    fetch_any_domain: bool,
    workers: u8,
}

In the impl of the Crawler I added the run function:

   fn run(&self) {
        {
            match self
                .queue
                .lock()
                .unwrap()
                .add(self.convert_link_to_abs(self.target.as_str()))
            {
                Err(e) => println!("{}", e),
                _ => (),
            }
        }

        while self.queue.lock().unwrap().size() > 0 {
            match self.queue.lock().unwrap().remove() {
                Ok(link) => match self.fetch(link.as_str()) {
                    Ok(content) => match self.get_links(content) {
                        Ok(()) => println!("added new links"),
                        Err(e) => println!("{}", e),
                    },
                    Err(e) => println!("{}", e),
                },
                Err(e) => println!("{}", e),
            }
        }
    }

And I was trying to call it concurrently with something like this:

        let mut threads = vec![];
        let c = Arc::new(Mutex::new(crawler));
        for _i in 0..workers {
            let cc = c.clone();
            threads.push(thread::spawn(move || {
                let guard = cc.lock().unwrap();
                guard.run();
            }));
        }

        for t in threads {
            let _ = t.join();
        }

The code somehow runs but it gets stuck pretty much instantly without processing anything. I'm sure I just need to get used to the Rust approach, but could someone advice on what's the best way to achieve a multithreaded crawler?

Many thanks


Solution

  • The problem is not with the HashSet, but with the Queue. If you replace the Queue you have from that external crate with a Vec from the standard library and split up some statements, it will work fine.

    fn run(&self) {
            {
                self.queue
                    .lock()
                    .unwrap()
                    .push(self.convert_link_to_abs(self.target.as_str()))
            }
    
            while self.queue.lock().unwrap().len() > 0 {
                let x = self.queue.lock().unwrap().pop();
                match x {
                    Some(link) => match self.fetch(&link) {
                        Ok(content) => match self.get_links(content) {
                            Ok(()) => println!("added new links"),
                            Err(e) => println!("{}", e),
                        },
                        Err(e) => println!("{}", e),
                    },
                    _ => {}
                }
            }
        }
    

    The biggest change is that I'm popping from the queue outside of the match statement. I think if you have the whole .lock().unwrap().pop() statement in the match, the lock will be held for the whole content of the match block.

    However, I'm not sure why it doesn't work if you do the same with the Queue crate you used. I'm also a Rust beginner, so some of this is still unclear to me as well.

    The changes I made to your code can be seen here: https://pastebin.com/ZrXrsgzf . I tested this and it runs (at least it gets past where it got stuck initially).

    I also recently implemented a web crawler in Rust and wrote about it here.