Search code examples
pythonlxmlpyquery

Stop pyquery inserting spaces where there aren't any in source HTML?


I am trying to get some text from an element, using pyquery 1.2. There are no spaces in the displayed text, but pyquery is inserting spaces.

Here is my code:

from pyquery import PyQuery as pq
html = '<h1><span class="highlight" style="background-color:">Randomized</span> and <span class="highlight" style="background-color:">non-randomized</span> <span class="highlight" style="background-color:">patients</span> in <span class="highlight" style="background-color:">clinical</span> <span class="highlight" style="background-color:">trials</span>: <span class="highlight" style="background-color:">experiences</span> with <span class="highlight" style="background-color:">comprehensive</span> <span class="highlight" style="background-color:">cohort</span> <span class="highlight" style="background-color:">studies</span>.</h1>'
doc = pq(html)
print doc('h1').text()

This produces (note spaces before colon and period):

Randomized and non-randomized patients in clinical trials : 
experiences with comprehensive cohort studies .

How can I stop pyquery inserting spaces into the text?


Solution

  • After reading PyQuery's source I found that the text() method returns the following:

    return ' '.join([t.strip() for t in text if t.strip()])
    

    Which means that the content of non-empty tags will always be separated by a single space. I guess the problem is that the textual representation of html is not well-defined so I don't think it could be considered a bug--especially since the example in the text() documentation does exactly this:

    >>> doc = PyQuery('<div><span>toto</span><span>tata</span></div>')
    >>> print(doc.text())
    toto tata
    

    If you want another behavior, try implementing your own version of text(). You can use the original version for inspiration since it's only 10 lines or so.