Search code examples
pythonweb-scrapingscrapy

get text separated by <wbr> from anchor tag in scrapy


How do I get all the text inside the tag, i.e.: "Digital Business Designer (m/w/d)" from a tag like this

<a class="title">Digital Business Designer (m/<wbr>w/<wbr>d)</a>

I have tried the code below but it returns "Digital Business Designer (m/" only.

    async def parse(self, response):
        programs = response.css('#programslist')
        for program in programs.css('.title'):
            title = program.css('::text').get()
            title = re.sub(r'<wbr>', '', title)
            yield {'title': title}

Solution

  • You can use the xpath //text() directive to get the inner text for an element and all it's children, in a list, when combined with getall(). Then you can just use ''.join() to combine the text back into a single string.

    For example:

    from scrapy.http.response.text  import TextResponse
    
    def parse(response):
        lst = response.xpath("//a[@class='title']//text()").getall()
        text = "".join(lst)
        print(text)
    
    doc = """
    <html>
        <body>
            <a class="title">Digital Business Designer (m/<wbr>w/<wbr>d)</a>
        </body>
    </html>
    """.encode("utf8")
    
    response = TextResponse("url", body=doc)
    parse(response)
    

    OUTPUT

    Digital Business Designer (m/w/d)