python html beautifulsoup text-extraction

Extract text from HTML, handling whitespace and and tags like a browser

I am trying to extract text from an XHTML table, as plain text, but preserving the line breaks that would appear if the document were rendered in an HTML renderer. I don't want to preserve the line breaks in the actual raw XML file.

The raw table cells contain lots of superfluous whitespace that HTML browsers don't render, and also contain  and   tags (which obviously are rendered).

Here is an example of the type of cell the source document contains:

<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In 
  Interpolated position motion mode the set-point buffer is full. The last 
  received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>

The extracted text for this cell should look like this:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

Or like this (with an extra new line between the paragraphs):

INTERPOLATION QUEUE FULL

In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

When I use BeautifulSoup's .get_text(separator=' ',strip=True) method, whitespace in the XML within a text element that would not be rendered in a browser is preserved in the output, like this:

INTERPOLATION QUEUE FULL In \n      Interpolated position motion mode the set-point buffer is full. The last \n      received set-point is not interpolated.

When I use the more-sophisticated BeautifulSoup-based answer from this question, much of the unwanted whitespace disappears but the non-rendered linebreaks are still present, e.g. between "In" and "Interpolated".

When I use Html2Text in its default settings, the non-rendered whitespace is stripped like I want, but the  and   tags present in the underlying HTML are ignored, and it injects additional line breaks that are not present in the HTML paragraphs.

Code snippet of my Html2Text usage:

h2t = html2text.HTML2Text()
h2t.ignore_emphasis=True

def element2html(element):
    return ET.tostring(element, encoding='unicode', method='xml')

def get_text(element):
    html = element2html(element)
    return h2t.handle(html).strip()

Example output from code above:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point\nbuffer is full. The last received set-point is not interpolated.

I can suppress the linebreak insertion by configuring the Html2Text converter with BodyWidth=0:

h2t = html2text.HTML2Text()
h2t.body_width=0
h2t.ignore_emphasis=True
[...]

But it is still discarding the  and   layout information from the original HTML. Sample output:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

How can I extract the text with whitespace handled the way a browser would?

UPDATE: Here is another verbatim example of sample XHTML from the source document. (This time I did not elide the formatting attributes on the <td> tag).

<td style="BORDER-TOP: medium none; HEIGHT: 13.5pt; BORDER-RIGHT: red 1pt solid; WIDTH: 205.55pt; BACKGROUND: white; BORDER-BOTTOM: red 1pt solid; PADDING-BOTTOM: 0pt; PADDING-TOP: 0pt; PADDING-LEFT: 5.4pt; BORDER-LEFT: red 1pt solid; PADDING-RIGHT: 5.4pt" valign="top" width="274">
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">Motor 
  stuck - the motor is powered but is not moving according to the definition 
  of <b>CL[2]</b> and <b>CL[3].</b></span><span style="FONT-SIZE: 11pt"></span></p></td>

I would like the extracted text to be like this (no line breaks):

Motor stuck - the motor is powered but is not moving according to the definition of CL[2] and CL[3].

It's perfectly fine for the  tags to be stripped from the output, but running text = [" ".join(p.getText(strip=True).replace("\n", "").split()) for p in soup] on this input also deletes the whitespace around the  tags.

So the actual output looks like:

Motor stuck - the motor is powered but is not moving according to the definition ofCL[2]andCL[3].

Solution

Since I'm not sure if the given HTML sample really wraps the  as in the question then this makes my answer an educated guess, but you could try something as simple as this:

from bs4 import BeautifulSoup

sample_html = """<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""

soup = BeautifulSoup(sample_html, 'html.parser').getText(strip=True, separator='\n')
print(soup)

This should print:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

However, if the sample is actually spaced the way it is, then, IMHO, you don't need any fancy modules.

For example, this:

from bs4 import BeautifulSoup

sample_html = """<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In 
  Interpolated position motion mode the set-point buffer is full. The last 
  received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""

soup = BeautifulSoup(sample_html, 'html.parser').find_all("p")
text = [" ".join(p.getText().replace("\n", "").split()) for p in soup]
print("\n".join(text))

Gives this:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

Extract text from HTML, handling whitespace and <p> and <br> tags like a browser