Search code examples
pythonhtml-parsinglowercase

How to change string in a certain tag to lowercase in multiple files


I have the following requirement which I am trying to meet in Windows 10 using Python script:

  1. Change all the filenames to lowercase in multiple folders recursively. For this, I used the following code:

    import os path = "C://Users//shilpa//Desktop//content"
    for dir,subdir,listfilename in os.walk(path):
        for filename in listfilename:  
            new_filename = filename.lower()
            src = os.path.join(dir, filename) 
            dst = os.path.join(dir, new_filename) 
            os.rename(src,dst)
    
  2. Update the references of these files embedded in certain tag. The tag here is <img href=(filename.png)>. Here, the <img href=> is constant and the filenames filename.png are different.

So, here is the example:

Existing filenames:

  • ABC.dita
  • XYZ.dita
  • IMG.PNG

These are referenced in different files, say IMG.PNG is referenced in XYZ.dita.

After step1, these change as follows:

  • abc.dita
  • xyz.dita
  • img.png

This will break all the references included in different files.

I want to update all the changed filename references so that the links stay intact.

I don't have any experience with Python and a beginner only. To achieve step2, I should be able to use regex and find a pattern, say,

<img href="(this will be a link to the IMG.PNG>". This will be a part of .dita file.

After step1, the reference in the file will break.

How can I make changes to the filename and also retain their references? The ask here is, find and replace the old names by new names in all the files.

Any help is appreciated.


Solution

  • I'll assume you hav e a folder containing all the relevant files. So the problem is divided into two parts:

    1. Running in a loop over the files
    2. For each file lower the references

    Loopin Over The Files

    This can be done using glob or os.walk.

    import os
    
    upper_directory = "[insert your directory]"
    for dirpath, directories, files in os.walk(upper_directory):
        for fname in files:
            path = os.path.join(dirpath, fname)
            lower_file_references(path)
    

    Lower-casing The references

    When you have the path of the file, you need to read the data from it:

    with open(path) as f:
        s = f.read()
    

    Once you have the data of the referencing files, you can use this code to lower the reference string:

    head = "<img href="
    tail = ">"
    img_start = s.find(head, start)
    while img_start != -1:
        img_end = s.find(">", img_start)
        s = s[:img_start] +s[img_start:img_end].lower() + s[img_end:]
        img_start = img_end
    

    Alternately you can use some XML parsing module. For example BeautifulSoup, this would help avoiding problems like href= vs href =

    from bs4 import BeautifulSoup as bs
    
    s = bs(s)
    imgs = s.find_all("img")
    for i in imgs:
        if "href" in i.attrs:
            i.attrs["href"] = i.attrs["href"].lower()
    s = str(s)
    

    In both ways you can then rewrite the files. you can do it like that:

    with open(path, "w") as f:
        f.write(s)
    

    Putting it all together

    import os
    from bs4 import BeautifulSoup as bs
    
    def lower_file_references(file_path):
        with open(path) as f:
            s = f.read()
        s = bs(s)
        imgs = s.find_all("img")
        for i in imgs:
            if "href" in i.attrs:
                i.attrs["href"] = i.attrs["href"].lower()
        s = str(s)
        with open(path, "w") as f:
            f.write(s)
    
    upper_directory = "[insert your directory]"
    for dirpath, directories, files in os.walk(upper_directory):
        for fname in files:
            path = os.path.join(dirpath, fname)
            lower_file_references(path)
    

    I must say that this is a simple method, which will work great assuming your files aren't huge. If you have big files that can't be read to memory all at once, or a lot of files you might want to think of ways avoiding reading the data of all the files.