Search code examples
href

Extracting URL from anchor that has a data-encoded-url


I'm trying to extract the "Website" link on the page

https://www.tripadvisor.com.sg/Restaurant_Review-g294265-d17171783-Reviews-Fu_Lin_Men_NSRCC-Singapore.html

When I view the HTML in my browser console it is

<a data-encoded-url="aVZVX2h0dHA6Ly93d3cuZnVsaW5tZW4uY29tLnNnL2Z1LWxpbi1tZW4tbnNyY2NfVFJS" class="_2wKz--mA _15QfMZ2L" target="_blank" href="http://www.fulinmen.com.sg/fu-lin-men-nsrcc">Website  ... </a>

When I request this element in scrapy shell using

response.css('a:contains("Website")').get(),

I get

 ('<a data-encoded-url="QTh2X2h0dHA6Ly93d3cuZnVsaW5tZW4uY29tLnNnL2Z1LWxpbi1tZW4tbnNyY2NfT0ha" class="_2wKz--mA _15QfMZ2L" target="_blank">Website ... </a>',)

Which does not have a href attribute!

It seems that the browser turns the data-encoded-url into a href but scrapy does not.

I can extract the data-encoded-url but I can't find any information on converting that to a URL


Solution

  • atob("aVZVX2h0dHA6Ly93d3cuZnVsaW5tZW4uY29tLnNnL2Z1LWxpbi1tZW4tbnNyY2NfVFJS").replace(/^.*_(.*)_.*$/, "$1")
    

    gives http://www.fulinmen.com.sg/fu-lin-men-nsrcc