Search code examples
pythonxmlpython-re

Detecting and replacing xml from string in python


I have a file that contains text as well as some xml content dumped into it. It looks something like this :

The authentication details : <id>70016683</id><password>password@123</password>
The next step is to send the request.
The request : <request><id>90016133</id><password>password@3212</password></request>
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>

I am using a python program to parse this file. I would like to replace the xml part with a place holder : xml_obj. The output should look something like this :

The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj

At the same time I would also like to extract the replaced xml text and store it in a list. The list should contain None if the line doesn't have an xml object.

  • I have tried using regex for this purpose :
       xml_tag = re.search(r"<\w*>",line)
       if xml_tag:
            start_position = xml_tag.start()
            xml_word = xml_tag.group()[:1]+'/'+xml_tag.group()[1:]
            xml_pattern = r'{}'.format(xml_word)
            stop_position = re.search(xml_pattern,line).stop()

But this code retrieves the start and stop positions for only one xml tag and it's content for the first line and the entire format for the last line ( in the input file ). I would like to get all xml content irrespective of the xml structure and also replace it with 'xml_obj'.

Any advice would be helpful. Thanks in advance.

Edit :

I also want to apply the same logic to files that look like this :

The authentication details : ID <id>70016683</id> Password <password>password@123</password> Authentication details complete
The next step is to send the request.
The request : <request><id>90016133</id><password>password@3212</password></request> Request successful
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>

The above files may have more than one xml object in a line.

They may also have some plain text after the xml part.


Solution

  • The following is a a little convoluted, but assuming that your actual text is correctly represented by the sample in your question, try this:

    txt = """[your sample text above]"""
    lines = txt.splitlines()
    entries = []
    new_txt = ''
    
    for line in lines:
        entry = (line.replace(' <',' xxx<',1).split('xxx'))
        if len(entry)==2:
            entries.append(entry[1])
            entry[1]="xml_obj"
            line=''.join(entry)
        else:
            entries.append('none')
        new_txt+=line+'\n'
    for entry in entries:
        print(entry)
    print('---')
    print(new_txt)
    

    Output:

    <id>70016683</id><password>password@123</password>
    none
    <request><id>90016133</id><password>password@3212</password></request>
    <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
    ---
    The authentication details : xml_obj
    The next step is to send the request.
    The request : xml_obj
    Additional info includes xml_obj