Search code examples
python-3.xhtml-parsingbioinformatics

How to determine the article PDF download link from the article webpage?


I would like to download some articles from my DOI list automatically (about 1500). Using doi.org I can get a website content for every of them. But the problem is that every website is unique and I do not know how to determine a download link amongst a number of hrefs. Please, could you suggest anything useful for such aim in Python?

P. S. The speech is about free access articles. So I can be sure that the link exists.


Solution

  • As it turned out, the most convenient way is to use metapub library. Note, that it demands Visual Studio C++ 2015 and recenter.

    import metapub
    from urllib.request import urlretrieve
    
    def downloadByDOI(doi, handle):
        def download(url, handle):
            try:
                urlretrieve(url, handle)
            except:
                download(url, handle)
        
        url = metapub.FindIt(doi=doi).url
        download(url, handle)