Search code examples
seleniumselenium-webdriverweb-scrapingwebdriverdiskcache

Setting Disk Cache size in Selenium, while webscraping multiple websites?


From the available information I understood that setting disk cache size in selenium will help in faster loading of the web pages, when we are doing the scraping or anything on single website. But my question is what good will it do if we set the disk cache size while dealing with multiple websites?

Or is it in fact bad to set disk cache size? When scraping multiple web pages i.e. in a way the websites can trace that we are scraping?


Solution

  • Disk Cache is a cache memory that is used to speed up the process of storing and accessing data from the host machine hard disk. It enables faster processing during reading/writing, issuing commands and other I/O process between the hard disk, the memory and computing components. A disk cache is also referred to as a disk buffer or cache buffer.


    Chromium disk cache

    The disk cache stores resources fetched from the web so that they can be accessed quickly at a latter time if needed. The main characteristics are:

    • The cache should not grow unbounded so there must be an algorithm for deciding when to remove old entries.
    • While it is not critical to lose some data from the cache, having to discard the whole cache should be minimized. The current design should be able to gracefully handle application crashes, no matter what is going on at that time, only discarding the resources that were open at that time. However, if the whole computer crashes while we are updating the cache, everything on the cache probably will be discarded.
    • Access to previously stored data should be reasonably efficient, and it should be possible to use synchronous or asynchronous operations.
    • We should be able to avoid conflicts that prevent us from storing two given resources simultaneously. In other words, the design should avoid cache trashing.
    • It should be possible to remove a given entry from the cache, and keep working with a given entry while at the same time making it inaccessible to other requests (as if it was never stored).
    • The cache should not be using explicit multithread synchronization because it will always be called from the same thread. However, callbacks should avoid reentrancy problems so they must be issued through the thread's message loop.

    Conclusion

    To conclude, by default will be configured with the default value for the which users can configure as per their respective usecases.


    Changing Chrome Cache size on Windows 10

    There is only one method that can be used to set and limit Google Chrome’s cache size.

    • Launch Google Chrome.

    Change-Chrome-Cache-size.png

    • Right-click on the icon for Google Chrome on the taskbar and again right-click on the entry labeled as Google Chrome.
    • Now click on Properties. It will open the Google Chrome Properties window.
    • Navigate to the tab labeled as Shortcut.
    • In the field called Target, type in the following after the whole address:

      -disk-cache-size-<size in bytes>
      
    • As an example, to configure it as -disk-cache-size-2147483648:

      "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -disk-cache-size-2147483648
      

    Google-Chrome-Properties-446x600.png

    Here 2147483648 is the size of the cache in bytes which is equal to 2 Gigabytes.

    • Click on Apply and then click on OK for the limit to be set.