web-scraping scrapy puppeteer tumblr pytumblr

How to scrape/download all tumblr images with a particular tag

I am trying to download many (1000's) of images from tumblr with a particular tag (.e.g #art). I am trying to figure out the fastest and easiest way to do this. I have considered both scrapy and puppeteer as options, and I read a little bit about the tumblr API, but I'm not sure how to use the API to locally download the images I want. Currently, puppeteer seems like the best way, but I'm not sure how to deal with the fact that tumblr uses lazy loading (e.g. what is the code for getting all the images, scrolling down, waiting for for images to load, and getting these) Would appreciate any tips!

Solution

My solution is below. Since I couldn't use offset, I used the timestamps of each post as an offset instead. Since I was trying to specifically get the links of images in the posts, I did a little processing of the output as well. I then used a simple python script to download every image from my list of links. I have included a website and an additional stack overflow post which I found helpful.

import pytumblr

def get_all_posts(client, blog):
    offset = None

    for i in range(48):
        #response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)

        response = client.tagged('YOUR TAG HERE', limit=20, before=offset)
        for post in response:
            #    for post in response:
            if('photos' not in post):
                #print(post)
                if('body' in post):
                    body = post['body']
                    body = body.split('<')
                    body = [b for b in body if 'img src=' in b]
                    if(body):
                        body = body[0].split('"')
                        print(body[1])
                        yield body[1]
                    else:
                        yield
            else:
                print(post['photos'][0]['original_size']['url'])
                yield post['photos'][0]['original_size']['url']

        # move to the next offset
        offset = response[-1]['timestamp']
    print(offset)

client = pytumblr.TumblrRestClient('USE YOUR API KEY HERE')

blog = 'staff'

# use our function
with open('{}-posts.txt'.format(blog), 'w') as out_file:
    for post in get_all_posts(client, blog):
        print(post, file=out_file)

Links:

https://64.media.tumblr.com/9f6b4d8d15caffe88c5877cd2fb31726/8882b6bec4975045-23/s540x810/49586f5b05e8661d77e370845d01b34f0f5f2ca6.png

Print more than 20 posts from Tumblr API

Also thank you very much to Harada, whose advice helped a lot!