I am trying to extract text from an XHTML table, as plain text, but preserving the line breaks that would appear if the document were rendered in an HTML renderer. I don't want to preserve the line breaks in the actual raw XML file.
The raw table cells contain lots of superfluous whitespace that HTML browsers don't render, and also contain <p></p>
and <br />
tags (which obviously are rendered).
Here is an example of the type of cell the source document contains:
<td>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL
</span><span style="FONT-SIZE: 11pt"></span></p>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In
Interpolated position motion mode the set-point buffer is full. The last
received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>
The extracted text for this cell should look like this:
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
Or like this (with an extra new line between the paragraphs):
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
When I use BeautifulSoup's .get_text(separator=' ',strip=True)
method, whitespace in the XML within a text element that would not be rendered in a browser is preserved in the output, like this:
INTERPOLATION QUEUE FULL In \n Interpolated position motion mode the set-point buffer is full. The last \n received set-point is not interpolated.
When I use the more-sophisticated BeautifulSoup-based answer from this question, much of the unwanted whitespace disappears but the non-rendered linebreaks are still present, e.g. between "In" and "Interpolated".
When I use Html2Text in its default settings, the non-rendered whitespace is stripped like I want, but the <p>
and <br />
tags present in the underlying HTML are ignored, and it injects additional line breaks that are not present in the HTML paragraphs.
Code snippet of my Html2Text usage:
h2t = html2text.HTML2Text()
h2t.ignore_emphasis=True
def element2html(element):
return ET.tostring(element, encoding='unicode', method='xml')
def get_text(element):
html = element2html(element)
return h2t.handle(html).strip()
Example output from code above:
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point\nbuffer is full. The last received set-point is not interpolated.
I can suppress the linebreak insertion by configuring the Html2Text converter with BodyWidth=0:
h2t = html2text.HTML2Text()
h2t.body_width=0
h2t.ignore_emphasis=True
[...]
But it is still discarding the <p>
and <br />
layout information from the original HTML. Sample output:
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
How can I extract the text with whitespace handled the way a browser would?
UPDATE:
Here is another verbatim example of sample XHTML from the source document. (This time I did not elide the formatting attributes on the <td>
tag).
<td style="BORDER-TOP: medium none; HEIGHT: 13.5pt; BORDER-RIGHT: red 1pt solid; WIDTH: 205.55pt; BACKGROUND: white; BORDER-BOTTOM: red 1pt solid; PADDING-BOTTOM: 0pt; PADDING-TOP: 0pt; PADDING-LEFT: 5.4pt; BORDER-LEFT: red 1pt solid; PADDING-RIGHT: 5.4pt" valign="top" width="274">
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">Motor
stuck - the motor is powered but is not moving according to the definition
of <b>CL[2]</b> and <b>CL[3].</b></span><span style="FONT-SIZE: 11pt"></span></p></td>
I would like the extracted text to be like this (no line breaks):
Motor stuck - the motor is powered but is not moving according to the definition of CL[2] and CL[3].
It's perfectly fine for the <b>
tags to be stripped from the output, but running text = [" ".join(p.getText(strip=True).replace("\n", "").split()) for p in soup]
on this input also deletes the whitespace around the <b>
tags.
So the actual output looks like:
Motor stuck - the motor is powered but is not moving according to the definition ofCL[2]andCL[3].
Since I'm not sure if the given HTML sample really wraps the <p>
as in the question then this makes my answer an educated guess, but you could try something as simple as this:
from bs4 import BeautifulSoup
sample_html = """<td>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL
</span><span style="FONT-SIZE: 11pt"></span></p>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""
soup = BeautifulSoup(sample_html, 'html.parser').getText(strip=True, separator='\n')
print(soup)
This should print:
INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
However, if the sample is actually spaced the way it is, then, IMHO, you don't need any fancy modules.
For example, this:
from bs4 import BeautifulSoup
sample_html = """<td>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL
</span><span style="FONT-SIZE: 11pt"></span></p>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In
Interpolated position motion mode the set-point buffer is full. The last
received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""
soup = BeautifulSoup(sample_html, 'html.parser').find_all("p")
text = [" ".join(p.getText().replace("\n", "").split()) for p in soup]
print("\n".join(text))
Gives this:
INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.