Search code examples
pythonhtmlfilesave

Save html to file to work with later using Beautiful Soup


I am doing a lot of work with Beautiful Soup. However, my supervisor does not want me doing the work "in real time" from the web. Instead, he wants me to download all the text from a webpage and then work on it later. He wants to avoid repeated hits on a website.

Here is my code:

import requests
from bs4 import BeautifulSoup

url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ' 
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

I am unsure whether I should save "page" as a file and then import that into Beautiful Soup, or whether I should save "soup" as a file to open later. I also do not know how to save this as a file in a way that can be accessed as if it were "live" from the internet. I know almost nothing about Python, so I need the absolute easiest and simplest process for this.


Solution

  • So saving soup would be... tough, and out of my experience (read more about the pickleing process if interested). You can save the page as follows:

    page = requests.get(url)
    with open('path/to/saving.html', 'wb+') as f:
        f.write(page.content)
    

    Then later, when you want to do analysis on it:

    with open('path/to/saving.html', 'rb') as f:
        soup = BeautifulSoup(f.read(), 'lxml')
    

    Something like that, anyway.