Reading from a web directory instead of single url

I have a python script I'm using to parse html elements from a url, with the aid of Beautiful Soup.

I now want to parse all html files in the directory, rather than picking every file and executing one by one. After a weekend of working through modifying my script, I have hit a brick wall!

I have played around with os.walk to help me but I am struggling to integrate with my current script. I'm thinking there should be way to simply write a loop and change my input from a file to a directory? But that does that mean that I can no longer use urllib because my url is now a file list?

This is the start of my script. All the parsed elements in each file in the directory are identical, so nothing else should need to be changed.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://www.mywebsite.com/src_files/abc1.html'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

Expected results should be as if I ran my current script on each html file in the directory individually.

Solution

Yes, you don't need urllib anymore, since you want to parse saved HTML files in a directory (not fetching HTML pages from remote HTTP server).

To find all HTML files in a directory you need to use glob module.

Example:

from bs4 import BeautifulSoup
from glob import glob

# returns list of all .html files in directory.
htmlFilesList = glob('./*.html') 

for i, htmlFile in enumerate(htmlFilesList):
    saved_html = open(htmlFile, 'r', encoding="utf-8")
    soup = BeautifulSoup(saved_html, 'html.parser')

    # Close opened file
    saved_html.close()