Search code examples
seleniumcurljsoup

Why is jsoup so much faster than selenium or curl?


I have written numerous webscrapers over the past decade. Initially in C++, then C#, but most recently, and extensively, in java and python. These days, it is merely a coin toss as to whether I write a webscraper in java or python. However, I have noticed for the past 3 or 4 years that somehow, some way, jsoup is significantly faster than either pycurl, pyrequest and certainly selenium. What is jsoup's secret? Why does it blow every other method out of the water with respect to speed?


Solution

  • I don't know about pycurl and pyrequest, but I can tell you about JSoup and Selenium. The big difference is that selenium webdriver drives a real browser with a live DOM, which before each action of selenium is performed needs to check if the element in question is still at the same state. This interaction with a real browser is naturally much more elaborated than what JSoup does: JSoup is a simple HTML parser. So it parses the HTML document (or XML) once and creates an in memory representation. Only JSoup commands will alter the DOM, so JSoup can be super efficient in dealing with this content.

    The price that you pay with this approach is that JSoup naturally does not interpret or run Javascript. So websites that rely on asynchronous data loading will require you to understand it deeply and load the content directly as well. In selenium you can let the browser do all the work and "harvest" the rendered resulting HTML.