Search code examples
htmlurlweb-scrapinghyperlinkhref

URL in HTML and URL for desired link are not the same


I am working with mining some links from a Chinese academic article database.

It appears that when I refresh the page to an article I'm looking at, or simply copy and paste a url, the url redirects to the database's home page rather than the article.

For example, the following link goes to my search results: http://search.cnki.net/search.aspx?q=%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD

The first article's individual url is: http://www.cnki.net/kcms/detail/detail.aspx?dbcode=CJFQ&dbName=CJFQ2016&FileName=KJDB201615009&v=&uid=

However, if you try to directly click on the article link or refresh the article page, it redirects to the database home page. Why is this happening? Is there any way to get a "stable" url to these articles?

One detail that may matter, although I'm not sure, is that the url in the HTML code to the individual articles is also different.

<a href="http://epub.cnki.net/grid2008/brief/detailj.aspx?filename=KJDB201615009&amp;dbname=CJFDLAST2016" target="_blank">

Solution

  • It's not really up to you. The website you are referring checks if the link you are opening is a direct link or was opened from another page on the same website. This is probably to prevent embedding links of this website in other websites. In short, it does not allow direct links to its articles. You can see it by examining the header returned from the request.

    Instead of 200 OK you get 302. Instead of 200 OK you get 302.

    Which tells the browser to redirect to another location. You can try and fool the website by adding a "Referer" header to your request.

    If you look at the header look at the header that works you'll see that there is one. I did not try but I'm pretty sure it will work.