Search code examples
pythonfileweb-scrapingbeautifulsoup

Python - save requests or BeautifulSoup object locally


I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case "name") or the BeautifulSoup object (in this case "soup") locally so that next time I can save time. Here is the code:

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

Solution

  • Since name.content is just HTML, you can just dump this to a file and read it back later.

    Usually the bottleneck is not the parsing, but instead the network latency of making requests.

    from bs4 import BeautifulSoup
    import requests
    
    url = 'https://google.com'
    name = requests.get(url)
    
    with open("/tmp/A.html", "w") as f:
      f.write(name.content)
    
    
    # read it back in
    with open("/tmp/A.html") as f:
      soup = BeautifulSoup(f)
      # do something with soup
    

    Here is some anecdotal evidence for the fact that bottleneck is in the network.

    from bs4 import BeautifulSoup
    import requests
    import time
    
    url = 'https://google.com'
    
    t1 = time.clock();
    name = requests.get(url)
    t2 = time.clock();
    soup = BeautifulSoup(name.content)
    t3 = time.clock();
    
    print t2 - t1, t3 - t2
    

    Output, from running on Thinkpad X1 Carbon, with a fast campus network.

    0.11 0.02