Search code examples
awksedhtml-parsing

Replace occurrences on html file


I have to replace some kind of occurrences on thousands of html files and I'm intendind to use linux script for this. Here are some examples of replaces I have to do

From: <a class="wiki_link" href="/WebSphere+Application+Server">

To: <a class="wiki_link" href="/confluence/display/WIKIHAB1/WebSphere%20Application%20Server">

That means, add /confluence/display/WIKIHAB1 as prefix and replace "+" by "%20".

I'll do the same for other tags, like img, iframe, and so on...

First, which tool should I use to make it? Sed? Awk? Other?

If anybody has any example, I really appreciate.


Solution

  • After some research I found out Beautiful Soup. It's a python library to parse html files, really easy to use and very well docummented. I had no experience with Python and could wrote the code without problems. Here is an example of python code to make the replace that I mentioned in the question.

    #!/usr/bin/python
    
    import os
    from bs4 import BeautifulSoup
    
    #Replaces plus sign(+) by %20 and add /confluence... prefix to each
    #href parameter at anchor(a) tag that has wiki_link in class parameter
    def fixAnchorTags(soup):
        tags = soup.find_all('a')
    
        for tag in tags:
            newhref = tag.get("href")
    
            if newhref is not None:
                if tag.get("class") is not None and "wiki_link" in tag.get("class"):
                    newhref = newhref.replace("+", "%20")
                    newhref = "/confluence/display/WIKIHAB1" + newhref
                    tag['href'] = newhref
    
    #Creates a folder to save the converted files                   
    def setup():
        if not os.path.exists("converted"):
            os.makedirs("converted")
    
    #Run all methods for each html file in the current folder
    def run():
        for file in os.listdir("."):
            if file.endswith(".html"):
                print "Converting " + file
                htmlfile = open(file, "r")
                converted = open("converted/"+file, "w")
                soup = BeautifulSoup(htmlfile, "html.parser")
    
                fixAnchorTags(soup)
    
                converted.write(soup.prettify("UTF-8"))
                converted.close()
                htmlfile.close()
    
    setup()
    run()