Search code examples
pythonhtmlreplacehref

Modify html file (Find and replace href url and save it)


EDIT1:

I found a mistake in my original code which was giving me the typeError. So the answer was here: BeautifulSoup - modifying all links in a piece of HTML?. The code now is working.

I have an html file, I want to change some of the href url for others and save it again as an html file. My goal is that when I open the html file and click on a link, it will take me to a internal folder rather than an internet url (the original one).

I mean, I want to convert this: <a href="http://www.somelink.com"> into this: <a href="C:/myFolder/myFile.html">.

I tried to open the file with bs4 and use the replace function but I am getting TypeError: 'NoneType' object is not callable

This is my code by now:


# Dict which relates the original links with my the ones to replace them

links_dict = { original_link1 : my_link1 , original_link2 : my_link2 } # and so on..

# Get a list of links to loop and find them into the html file

original_links = links_dict .keys() 

soup = BeautifulSoup(open(html_file), "html.parser",encoding="utf8")

# This part is where I am stuck, the theory is loop through 'original_links'
 and if any of those links is found, replace it with the one I have in 'links_dict'

for link in soup.find_all('a',href=True):
    if link['href'] in links_dict:
        link['href'] = link['href'].replace(link['href'],links_dict[link['href']]

with open("new_file.html", "w",encoding="utf8") as file:
    file.write(str(soup))

Any ideas?


Solution

  • Once you've got some soup to process, you should look for 'a' elements, then check their 'href' attributes and if they match those in your dict, replace as required.

    I'd make the 'original_link1' etc regexps, so you can match easily.

    As it happens, I believe that your question has already been answered, please see BeautifulSoup - modifying all links in a piece of HTML?