Search code examples
cssscrapyhref

How to extract text in similar a href by scrapy


I want to extract text as below,

SUBTHEME_SELECTOR = '.subtheme::text',
YEAR_SELECTOR = '.year::text'
but I am not sure how to extract Theme, can you help me?
THEME_SELECTOR = '//a[contains(@href, "/sets/theme-")]/@href' ???

<div class='tags floatleft'>
    <a href='/sets/10251-1/Brick-Bank'>10251-1</a> 
    <a href='/sets/theme-Creator-Expert'>Creator Expert</a> 
    <a class='subtheme' href='/sets/theme-Creator-Expert/subtheme-Modular-Buildings'>Modular Buildings</a> 
    <a class='year' href='/sets/theme-Creator-Expert/year-2016'>2016</a> 
</div>

Solution

  • You got it right. You can test it quite simply even without actually scraping the site:

    import scrapy
    
    TEXT = """
    <div class='tags floatleft'>
        <a href='/sets/10251-1/Brick-Bank'>10251-1</a> 
        <a href='/sets/theme-Creator-Expert'>Creator Expert</a> 
        <a class='subtheme' href='/sets/theme-Creator-Expert/subtheme-Modular-Buildings'>Modular Buildings</a> 
        <a class='year' href='/sets/theme-Creator-Expert/year-2016'>2016</a> 
    </div>
    """
    
    s = scrapy.Selector(text=TEXT)
    link = s.xpath('//a[contains(@href,"/sets/theme-")]/@href').extract_first()
    text = s.xpath('//a[contains(@href,"/sets/theme-")]/text()').extract_first()
    print(link)
    print(text)
    

    Produces:

    /sets/theme-Creator-Expert
    Creator Expert