Search code examples
pythonxmlloopstextreplace

Python - Having trouble iterating through XML files, searching for text, and replacing it where needed


I have several thousand .XML files. Some of the text needs to be changed, as the files were generated with the wrong label. I need to iterate through all of them in a given directory and make the changes where needed. Here is an example of a file in the directory:

<annotation>
    <folder>resized</folder>
    <filename>P123584521_009.jpg</filename>
    <path>D:\Users\path_to_image\P123584521_009.jpg</path>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>1024</width>
        <height>1024</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>Green plant</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>575</xmin>
            <ymin>548</ymin>
            <xmax>866</xmax>
            <ymax>759</ymax>
        </bndbox>
    </object>
    <object>
        <name>Green plant</name>
        <pose>Unspecified</pose>
        <truncated>1</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>827</xmin>
            <ymin>449</ymin>
            <xmax>1024</xmax>
            <ymax>798</ymax>
        </bndbox>
    </object>
    <object>
        <name>Green plant</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>198</xmin>
            <ymin>505</ymin>
            <xmax>559</xmax>
            <ymax>747</ymax>
        </bndbox>
    </object>
    <object>
        <name>Green plant</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>592</xmin>
            <ymin>730</ymin>
            <xmax>787</xmax>
            <ymax>945</ymax>
        </bndbox>
    </object>
    <object>
        <name>Green plant</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>362</xmin>
            <ymin>756</ymin>
            <xmax>597</xmax>
            <ymax>1008</ymax>
        </bndbox>
    </object>
    <object>
        <name>Green plant</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>219</xmin>
            <ymin>748</ymin>
            <xmax>376</xmax>
            <ymax>894</ymax>
        </bndbox>
    </object>
    <object>
        <name>Green plant</name>
        <pose>Unspecified</pose>
        <truncated>1</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>1</xmin>
            <ymin>648</ymin>
            <xmax>351</xmax>
            <ymax>1024</ymax>
        </bndbox>
    </object>
</annotation>

There are 7 annotations, all entitled "Green plant." I need to replace each occurrence of that phrase with just "Plant." Here is the code I wrote to try to do this:

import os
from tqdm import tqdm
import sys

path = 'D:\\Users\\directory_with_all_xml_files'

files = os.listdir(path)

for file in tqdm(files):
    filename, filetype = file.split('.')
    if filetype == 'xml':
        #Open file
        xml_file = open(file)
        new_file_content = ""
        
        #Replace text
        for line in xml_file:
            stripped_line = line.strip()
            new_line = stripped_line.replace("Green plant", "Plant")
            new_file_content += new_line + "\n"
        xml_file.close()
        
        #Overwrites old file content with new file content
        write_file = open(file)
        write_file.write(new_file_content)
        write_file.close()

However, when I run this code, I get the following:

  File "xml_text_replacer.py", line 13, in <module>
    xml_file = open(file)
FileNotFoundError: [Errno 2] No such file or directory: 'Name_of_very_first_xml_file_in_directory.xml'

I tried to write an if statement to open each XML file, as you can see in the code. However, it's not iterating like I need it to. As can be seen, there is no iteration, and only the first .xml file in the entire directory is listed. How can this code be corrected to accomplish this task?


Solution

  • Here is an XSLT to do the job.

    It is following a so called Identity Transform pattern.

    XSLT

    <?xml version="1.0"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
        <xsl:strip-space elements="*"/>
    
        <xsl:template match="@*|node()">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    
        <xsl:template match="name[.='Green plant']">
            <xsl:copy>Plant</xsl:copy>
        </xsl:template>
    </xsl:stylesheet>