I am trying to get some text from an element, using pyquery 1.2. There are no spaces in the displayed text, but pyquery is inserting spaces.
Here is my code:
from pyquery import PyQuery as pq
html = '<h1><span class="highlight" style="background-color:">Randomized</span> and <span class="highlight" style="background-color:">non-randomized</span> <span class="highlight" style="background-color:">patients</span> in <span class="highlight" style="background-color:">clinical</span> <span class="highlight" style="background-color:">trials</span>: <span class="highlight" style="background-color:">experiences</span> with <span class="highlight" style="background-color:">comprehensive</span> <span class="highlight" style="background-color:">cohort</span> <span class="highlight" style="background-color:">studies</span>.</h1>'
doc = pq(html)
print doc('h1').text()
This produces (note spaces before colon and period):
Randomized and non-randomized patients in clinical trials :
experiences with comprehensive cohort studies .
How can I stop pyquery inserting spaces into the text?
After reading PyQuery
's source I found that the text()
method returns the following:
return ' '.join([t.strip() for t in text if t.strip()])
Which means that the content of non-empty tags will always be separated by a single space. I guess the problem is that the textual representation of html is not well-defined so I don't think it could be considered a bug--especially since the example in the text()
documentation does exactly this:
>>> doc = PyQuery('<div><span>toto</span><span>tata</span></div>')
>>> print(doc.text())
toto tata
If you want another behavior, try implementing your own version of text()
. You can use the original version for inspiration since it's only 10 lines or so.