Search code examples
pythonjsontwitterurllibtweepy

Getting additional image urls (i.e. not just first) from tweet with tweepy and python


I am making a script to download the images from the user's timeline (from each tweet/status) and have got it working well, but only for retrieving and downloading the first image, it will not get the image urls for additional images (i.e. 2nd, 3rd, 4th etc.) in each tweet/status.

My (badly written) code:

timeline = tweepy.Cursor(api.user_timeline, tweet_mode='extended').items() 

for tweet in timeline:
        imagesfiles = [] # this is the list I want the image urls to go into
        if 'media' in tweet.entities:
            for image in tweet.entities['media']:
                file_location = str(image['media_url'])
                imagesfiles.append(file_location)
                file_location = "data/images/" + file_location.rsplit('/', 1)[-1]
                urllib.urlretrieve(image['media_url'], file_location)
        else:
            imagesfiles = "noimages"
            print "no images"

print imagesfiles # this should be a list of the urls for media in the tweet/status, but it only ever returns one url (the first image url) and never the rest in the status (i.e. 2nd, 3rd, 4th images etc).

Can anyone see any obvious reason why this would only get the first media_url in Twitter's returned data for status with extra media entities?

The JSON from Twitter looks something as follows:

u'extended_entities': {
    u'media': [
      {
        u'expanded_url': u'https://twitter.com/someone/status/84848343434888484/photo/1',
        u'display_url': u'pic.twitter.com/Dbasasasdamh6W',
        u'url': u'SHORT_URL_FROM_TWITTER',
        u'media_url_https': u'https://pbs.twimg.com/media/Dbasasasdamh6W.jpg',
        u'id_str': u'84848343434888484',
        u'sizes': {
          u'large': {
            u'h': 800,
            u'resize': u'fit',
            u'w': 800
          },
          u'small': {
            u'h': 680,
            u'resize': u'fit',
            u'w': 680
          },
          u'medium': {
            u'h': 800,
            u'resize': u'fit',
            u'w': 800
          },
          u'thumb': {
            u'h': 150,
            u'resize': u'crop',
            u'w': 150
          }
        },
        u'indices': [
          55,
          78
        ],
        u'type': u'photo',
        u'id': 84848343434888484,
        u'media_url': u'https://pbs.twimg.com/media/Dbasasasdamh6W.jpg'
      },
      {
        u'expanded_url': u'https://twitter.com/someone/status/435345345345345345/photo/1',
        u'display_url': u'pic.twitter.com/otasws6y36',
        u'url': u'SHORT_URL_FROM_TWITTER',
        u'media_url_https': u'https://pbs.twimg.com/media/DbeXj4as4fs43fO.jpg',
        u'id_str': u'435345345345345345',
        u'sizes': {
          u'large': {
            u'h': 1024,
            u'resize': u'fit',
            u'w': 1024
          },
          u'small': {
            u'h': 680,
            u'resize': u'fit',
            u'w': 680
          },
          u'medium': {
            u'h': 1024,
            u'resize': u'fit',
            u'w': 1024
          },
          u'thumb': {
            u'h': 150,
            u'resize': u'crop',
            u'w': 150
          }
        },
        u'indices': [
          55,
          78
        ],
        u'type': u'photo',
        u'id': 435345345345345345,
        u'media_url': u'http://pbs.twimg.com/media/DbeXj4as4fs43fO.jpg'
      }
    ]

Solution

  • After much searching I have determined the error. The additional urls for extra pictures are not in "entities" but in "extended_entities", and therefore, you need to do it so:

            if 'media' in tweet.entities:
                for media in tweet.extended_entities['media']:
                    print media['media_url']
    

    This then returns each media_url for each image (media) item.

    Et voilà!