Search code examples
pythonjsontwitterkeytweepy

Check if JSON var has nullable key (Twitter Streaming API)


I'm downloading tweets from Twitter Streaming API using Tweepy. I manage to check if downloaded data has keys as 'extended_tweet', but I'm struggling with an specific key inside another key.

def on_data(self, data):
    savingTweet = {}
        if not "retweeted_status" in data: 
            dataJson = json.loads(data)
            if 'extended_tweet' in dataJson:
                savingTweet['text'] = dataJson['extended_tweet']['full_text']
            else:
                savingTweet['text'] = dataJson['text']
            if 'coordinates' in dataJson:
                if 'coordinates' in dataJson['coordinates']:
                    savingTweet['coordinates'] = dataJson['coordinates']['coordinates']
            else:
                savingTweet['coordinates'] = 'null'

I'm checking 'extended_key' propertly, but when I try to do the same with ['coordinates]['coordinates] I get the following error:

TypeError: argument of type 'NoneType' is not iterable

Twitter documentation says that key 'coordinates' has the following structure:

"coordinates":
{
    "coordinates":
    [
        -75.14310264,
        40.05701649
    ],
    "type":"Point"
}

I achieved to solve it by just putting the conflictive check in a try, except, but I think this is not the most suitable approach to the problem. Any other idea?


Solution

  • So the twitter API docs are probably lying a bit about what they return (shock horror!) and it looks like you're getting a None in place of the expected data structure. You've already decided against using try, catch, so I won't go over that, but here are a few other suggestions.

    Using dict get() default

    There are a couple of options that occur to me, the first is to make use of the default ability of the dict get command. You can provide a fall back if the expected key does not exist, which allows you to chain together multiple calls.

    For example you can achieve most of what you are trying to do with the following:

    return {
        'text': data.get('extended_tweet', {}).get('full_text', data['text']),
        'coordinates': data.get('coordinates', {}).get('coordinates', 'null')
    }
    

    It's not super pretty, but it does work. It's likely to be a little slower that what you are doing too.

    Using JSONPath

    Another option, which is likely overkill for this situation is to use a JSONPath library which will allow you to search within data structures for items matching a query. Something like:

    from jsonpath_rw import parse
    
    matches = parse('extended_tweet.full_text').find(data)
    if matches:
        print(matches[0].value)
    

    This is going to be a lot slower that what you are doing, and for just a few fields is overkill, but if you are doing a lot of this kind of work it could be a handy tool in the box. JSONPath can also express much more complicated paths, or very deeply nested paths where the get method might not work, or would be unweildy.

    Parse the JSON first!

    The last thing I would mention is to make sure you parse your JSON before you do your test for "retweeted_status". If the text appears anywhere (say inside the text of a tweet) this test will trigger.

    JSON parsing with a competent library is usually extremely fast too, so unless you are having real speed problems it's not necessarily worth worrying about.