How to get the text in <a> tag that contains specific url

I have a question that I do not know the answer and it might be interesting. I am crawling for a link like that

    <a href="">Prosta delovna mesta  v Sandozu</a>

and now that I have found it I would also like to have the text of the tag: "Prosta delovna mesta v Sandozu"

How do I get the text? It seems easy with plain String and this would the solution:


but I am in a loop and I only have reference to this url. I tried several options like:


    word = "career"
    response.xpath('//a[contains(@href, "%s")]/text()').extract() % word

But none of them works. I am sort of looking how to put a reference instead of a string into '@href' or 'contains' function. Here is my code. Do you think there is a way to do it?

Thank you Marko

def parse(self, response):


    #We take all urls, they are marked by "href". These are either webpages on our website either new websites.
    urls = response.xpath('//@href').extract()

    #Base url.
    base_url = get_base_url(response) 

    #Loop through all urls on the webpage.
    for url in urls:

        #If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
        if url.endswith((
            '.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', 
            '.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', 

            '.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', 
            '.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', 

            #music and video
            '.mp3', '.mp4', '.mpg', '.ai', '.avi',
            '.MP3', '.MP4', '.MPG', '.AI', '.AVI',

            #compressions and other
            '.zip', '.rar', '.css', '.flv',
            '.ZIP', '.RAR', '.CSS', '.FLV',


        #If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it. 
        #However in this case we exclude good urls like
        if any(x in url for x in ['?', '%', '&', '#']):

        #Ignore ftp.
        if url.startswith("ftp"):

        #If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
        # -- It is true, that we may get some strange urls, but it is fine for now.
        if not (url.startswith("http")):

            url_orig = url
            url = urljoin(base_url,url)

        #We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.         
        if (urlparse(url).netloc == urlparse(base_url).netloc):

            #The main part. We look for webpages, whose urls include one of the employment words as strings.

            # -- Instruction. 
            # -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
            if any(x in url for x in [









                #We found url that includes one of the magic words. We check, if we have found it before. If it is new, we add it to the list "jobs_urls".
                if url not in self.jobs_urls:
                    item = JobItem()
                    item["link"] = url
                    #item["term"] = response.xpath('//a[@href=url_orig]/text()').extract() 
                    #item["term"] = response.xpath('//a[contains(@href, "career")]/text()').extract()

                    #We return the item.
                    yield item

            #We don't put "else" sentence because we want to explore the employment webpage to find possible new employment webpages.
            #We keep looking for employment webpages, until we reach the DEPTH, that we have set in 
            yield Request(url, callback = self.parse)


  • You need to have the url in quotes and use string formatting:

    item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract()