Search code examples
pythonxmlxml-parsingstrip-tags

Python XML finding the specific location of a tag


I am currently using parsing through an xml file using the built in lxml.etree in python. I am running into some issued regarding the extraction of the text within the element tags.

The following is example code of my current problem.

<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>

My conflict is the following:

I am using the first P tag to capture title of each body tag if there is a title. The title is (in most cases) the first P tag right after body tag (hence example code line 1 and line 4). I don't have a certain list of title names which is why I am using this method to capture titles.

The problem is when no titles exist within the body but there is P tag somewhere within the body tag that is not right after the body tag ( hence code line 2 and 3 ) the program takes that first P tag and the text within as a title. In this scenario that corresponding P tag is not title and shouldn't be treated as one, but since it is treated as one any text before that P tag is disregarded and not written over to the new text file.

For further clarification the following is what is written over to the text file.

Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

Desired output to text file

Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

Possible Solution:

1. Is there any way I can find the location of the first P tag. If the first P tag exist right after the body tag I would like to keep it. Any other P tag I would like to strip but keep the text. I can do this by using a built in function in lxml.etree

strip_tags()

Any insight on this problem or another possible solution is greatly appreciated ... thank you in advance!


Solution

  • I was able to identify the titles with BeautifulSoup and a regular expression.

    from bs4 import BeautifulSoup as soup
    from lxml import etree
    import re
    
    
    markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
    
    <body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
    
    <body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 
    
    <body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""
    
    
    soup = soup(markup,'html.parser')
    
    titles = soup.select('body')
    
    for title in titles:
        
        groups = re.search('<body> *<p>', str(title))
        has_title = groups != None
        if has_title:
            print(title.p.text)