Search code examples
web-scrapingscrapypuppeteertumblrpytumblr

How to scrape/download all tumblr images with a particular tag


I am trying to download many (1000's) of images from tumblr with a particular tag (.e.g #art). I am trying to figure out the fastest and easiest way to do this. I have considered both scrapy and puppeteer as options, and I read a little bit about the tumblr API, but I'm not sure how to use the API to locally download the images I want. Currently, puppeteer seems like the best way, but I'm not sure how to deal with the fact that tumblr uses lazy loading (e.g. what is the code for getting all the images, scrolling down, waiting for for images to load, and getting these) Would appreciate any tips!


Solution

  • My solution is below. Since I couldn't use offset, I used the timestamps of each post as an offset instead. Since I was trying to specifically get the links of images in the posts, I did a little processing of the output as well. I then used a simple python script to download every image from my list of links. I have included a website and an additional stack overflow post which I found helpful.

    import pytumblr
    
    def get_all_posts(client, blog):
        offset = None
    
        for i in range(48):
            #response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
    
            response = client.tagged('YOUR TAG HERE', limit=20, before=offset)
            for post in response:
                #    for post in response:
                if('photos' not in post):
                    #print(post)
                    if('body' in post):
                        body = post['body']
                        body = body.split('<')
                        body = [b for b in body if 'img src=' in b]
                        if(body):
                            body = body[0].split('"')
                            print(body[1])
                            yield body[1]
                        else:
                            yield
                else:
                    print(post['photos'][0]['original_size']['url'])
                    yield post['photos'][0]['original_size']['url']
    
            # move to the next offset
            offset = response[-1]['timestamp']
        print(offset)
    
    client = pytumblr.TumblrRestClient('USE YOUR API KEY HERE')
    
    blog = 'staff'
    
    # use our function
    with open('{}-posts.txt'.format(blog), 'w') as out_file:
        for post in get_all_posts(client, blog):
            print(post, file=out_file)
    

    Links:

    https://64.media.tumblr.com/9f6b4d8d15caffe88c5877cd2fb31726/8882b6bec4975045-23/s540x810/49586f5b05e8661d77e370845d01b34f0f5f2ca6.png

    Print more than 20 posts from Tumblr API

    Also thank you very much to Harada, whose advice helped a lot!