Search code examples
pythonhtmlbeautifulsouptext-extraction

Extract text from HTML, handling whitespace and <p> and <br> tags like a browser


I am trying to extract text from an XHTML table, as plain text, but preserving the line breaks that would appear if the document were rendered in an HTML renderer. I don't want to preserve the line breaks in the actual raw XML file.

The raw table cells contain lots of superfluous whitespace that HTML browsers don't render, and also contain <p></p> and <br /> tags (which obviously are rendered).

Here is an example of the type of cell the source document contains:

<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In 
  Interpolated position motion mode the set-point buffer is full. The last 
  received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>

The extracted text for this cell should look like this:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

Or like this (with an extra new line between the paragraphs):

INTERPOLATION QUEUE FULL

In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

When I use BeautifulSoup's .get_text(separator=' ',strip=True) method, whitespace in the XML within a text element that would not be rendered in a browser is preserved in the output, like this:

INTERPOLATION QUEUE FULL In \n      Interpolated position motion mode the set-point buffer is full. The last \n      received set-point is not interpolated.

When I use the more-sophisticated BeautifulSoup-based answer from this question, much of the unwanted whitespace disappears but the non-rendered linebreaks are still present, e.g. between "In" and "Interpolated".

When I use Html2Text in its default settings, the non-rendered whitespace is stripped like I want, but the <p> and <br /> tags present in the underlying HTML are ignored, and it injects additional line breaks that are not present in the HTML paragraphs.

Code snippet of my Html2Text usage:

h2t = html2text.HTML2Text()
h2t.ignore_emphasis=True

def element2html(element):
    return ET.tostring(element, encoding='unicode', method='xml')

def get_text(element):
    html = element2html(element)
    return h2t.handle(html).strip()

Example output from code above:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point\nbuffer is full. The last received set-point is not interpolated.

I can suppress the linebreak insertion by configuring the Html2Text converter with BodyWidth=0:

h2t = html2text.HTML2Text()
h2t.body_width=0
h2t.ignore_emphasis=True
[...]

But it is still discarding the <p> and <br /> layout information from the original HTML. Sample output:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

How can I extract the text with whitespace handled the way a browser would?

UPDATE: Here is another verbatim example of sample XHTML from the source document. (This time I did not elide the formatting attributes on the <td> tag).

<td style="BORDER-TOP: medium none; HEIGHT: 13.5pt; BORDER-RIGHT: red 1pt solid; WIDTH: 205.55pt; BACKGROUND: white; BORDER-BOTTOM: red 1pt solid; PADDING-BOTTOM: 0pt; PADDING-TOP: 0pt; PADDING-LEFT: 5.4pt; BORDER-LEFT: red 1pt solid; PADDING-RIGHT: 5.4pt" valign="top" width="274">
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">Motor 
  stuck - the motor is powered but is not moving according to the definition 
  of <b>CL[2]</b> and <b>CL[3].</b></span><span style="FONT-SIZE: 11pt"></span></p></td>

I would like the extracted text to be like this (no line breaks):

Motor stuck - the motor is powered but is not moving according to the definition of CL[2] and CL[3].

It's perfectly fine for the <b> tags to be stripped from the output, but running text = [" ".join(p.getText(strip=True).replace("\n", "").split()) for p in soup] on this input also deletes the whitespace around the <b> tags.

So the actual output looks like:

Motor stuck - the motor is powered but is not moving according to the definition ofCL[2]andCL[3].

Solution

  • Since I'm not sure if the given HTML sample really wraps the <p> as in the question then this makes my answer an educated guess, but you could try something as simple as this:

    from bs4 import BeautifulSoup
    
    sample_html = """<td>
      <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
      </span><span style="FONT-SIZE: 11pt"></span></p>
      <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""
    
    soup = BeautifulSoup(sample_html, 'html.parser').getText(strip=True, separator='\n')
    print(soup)
    

    This should print:

    INTERPOLATION QUEUE FULL
    In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
    

    However, if the sample is actually spaced the way it is, then, IMHO, you don't need any fancy modules.

    For example, this:

    from bs4 import BeautifulSoup
    
    sample_html = """<td>
      <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
      </span><span style="FONT-SIZE: 11pt"></span></p>
      <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In 
      Interpolated position motion mode the set-point buffer is full. The last 
      received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""
    
    soup = BeautifulSoup(sample_html, 'html.parser').find_all("p")
    text = [" ".join(p.getText().replace("\n", "").split()) for p in soup]
    print("\n".join(text))
    

    Gives this:

    INTERPOLATION QUEUE FULL
    In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.