I am currently using parsing through an xml file using the built in lxml.etree in python. I am running into some issued regarding the extraction of the text within the element tags.
The following is example code of my current problem.
<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>
My conflict is the following:
I am using the first P tag to capture title of each body tag if there is a title. The title is (in most cases) the first P tag right after body tag (hence example code line 1 and line 4). I don't have a certain list of title names which is why I am using this method to capture titles.
The problem is when no titles exist within the body but there is P tag somewhere within the body tag that is not right after the body tag ( hence code line 2 and 3 ) the program takes that first P tag and the text within as a title. In this scenario that corresponding P tag is not title and shouldn't be treated as one, but since it is treated as one any text before that P tag is disregarded and not written over to the new text file.
For further clarification the following is what is written over to the text file.
Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
Desired output to text file
Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
Possible Solution:
1. Is there any way I can find the location of the first P tag. If the first P tag exist right after the body tag I would like to keep it. Any other P tag I would like to strip but keep the text. I can do this by using a built in function in lxml.etree
strip_tags()
Any insight on this problem or another possible solution is greatly appreciated ... thank you in advance!
I was able to identify the titles with BeautifulSoup and a regular expression.
from bs4 import BeautifulSoup as soup
from lxml import etree
import re
markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""
soup = soup(markup,'html.parser')
titles = soup.select('body')
for title in titles:
groups = re.search('<body> *<p>', str(title))
has_title = groups != None
if has_title:
print(title.p.text)