I have a python script I'm using to parse html elements from a url, with the aid of Beautiful Soup.
I now want to parse all html files in the directory, rather than picking every file and executing one by one. After a weekend of working through modifying my script, I have hit a brick wall!
I have played around with os.walk
to help me but I am struggling to integrate with my current script. I'm thinking there should be way to simply write a loop and change my input from a file to a directory? But that does that mean that I can no longer use urllib
because my url is now a file list?
This is the start of my script. All the parsed elements in each file in the directory are identical, so nothing else should need to be changed.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://www.mywebsite.com/src_files/abc1.html'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
Expected results should be as if I ran my current script on each html file in the directory individually.
Yes, you don't need urllib
anymore, since you want to parse saved HTML files in a directory (not fetching HTML pages from remote HTTP server).
To find all HTML files in a directory you need to use glob
module.
Example:
from bs4 import BeautifulSoup
from glob import glob
# returns list of all .html files in directory.
htmlFilesList = glob('./*.html')
for i, htmlFile in enumerate(htmlFilesList):
saved_html = open(htmlFile, 'r', encoding="utf-8")
soup = BeautifulSoup(saved_html, 'html.parser')
# Close opened file
saved_html.close()