Search code examples
pythonmultithreadingseleniumservermultiprocessing

Would the combination of Multiprocessing and Multithreading be useful in python if CPU and RAM are maxed out?


I already came to an understanding how multiprocessing and multithreading can speed up a program:

  • Multiprocessing, is used for CPU bound tasks
  • Multithreading is used for network bound tasks

What if the task I am performing is both CPU bound and network bound?

My project is a selenium webscraper that would cycle through a list of keywords to search on Amazon. After searching for each keyword, I would extract the contents of all products on the first page (title, price, reviews, shipping methods etc.) and output those contents into an excel document.

I have some major blockages from this project:

  • There are 3,500+ keywords I need to scrape everyday and I can cycle around one keyword every 12 seconds using only one thread and one process. This needs to be sped up, however I seem to have maxed out my CPU and RAM when running the program (i5 and 16GB). Since I have maxed out my usage, would adding threads or processes help efficiency?
  • A major time component on the CPU side is parsing through each product contents then placing them in the correct column in my excel document. Essentially, Amazon does not make it easy to scrape their website meaning it is hard to distinguish a pattern in the HTML for easy pulling. Instead of pulling multiple small elements from each product (title, price, reviews, etc.) I resorted to one big pull where I captured all product contents THEN built an algorithm that would parse through all the information and upload it to the correct spot on the excel document.

The majority of run time seems to be spent parsing the information through the algorithm and uploading to the excel document. Keeping in mind my CPU and RAM usage is maxed out, would multithreading and multiprocessing do anything to increase efficiency?

Note: I can provide a code example, but for simplicity I left it out. I realize the easy answer may be: "upload to a server" but I wanted to use that as a last resort.


Solution

  • WebDriver is not thread-safe. That being said, still you can serialise access to the underlying WebDriver instance, you can share a reference in more than one thread. But this is not advisable. But you can always instantiate one WebDriver instance for each thread.

    Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time just like simulating a real user. But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your ideas are perfect.

    Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use or to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the desired tab/window.

    However, a viable solution may be to use the remote.webdriver which is an Abstract Base Class for all Webdriver subtypes. Abstract Base Class would allow custom implementations of Webdriver to be registered so that isinstance type checks will succeed.