How do I get all the text inside the tag, i.e.: "Digital Business Designer (m/w/d)"
from a tag like this
<a class="title">Digital Business Designer (m/<wbr>w/<wbr>d)</a>
I have tried the code below but it returns "Digital Business Designer (m/"
only.
async def parse(self, response):
programs = response.css('#programslist')
for program in programs.css('.title'):
title = program.css('::text').get()
title = re.sub(r'<wbr>', '', title)
yield {'title': title}
You can use the xpath //text()
directive to get the inner text for an element and all it's children, in a list, when combined with getall()
. Then you can just use ''.join()
to combine the text back into a single string.
For example:
from scrapy.http.response.text import TextResponse
def parse(response):
lst = response.xpath("//a[@class='title']//text()").getall()
text = "".join(lst)
print(text)
doc = """
<html>
<body>
<a class="title">Digital Business Designer (m/<wbr>w/<wbr>d)</a>
</body>
</html>
""".encode("utf8")
response = TextResponse("url", body=doc)
parse(response)
OUTPUT
Digital Business Designer (m/w/d)