Search code examples
pythonalgorithmtwitterpython-twitter

Is there a faster way to do this? (python twitter location)


I'm trying to return a dictionary that aggregates tweets by their nearest state center. I'm iterating over all the tweets, and for each tweet, I'm checking all the states to see which state is the closest.

what would be a better way to do this?

def group_tweets_by_state(tweets):
    """

    The keys of the returned dictionary are state names, and the values are
    lists of tweets that appear closer to that state center than any other.

    tweets -- a sequence of tweet abstract data types """


    tweets_by_state = {}
    for tweet in tweets:
        position = tweet_location(tweet)
        min, result_state = 100000, 'CA'
        for state in us_states:
            if geo_distance(position, find_state_center(us_states[state]))< min:
                min = geo_distance(position, find_state_center(us_states[state]))
                result_state = state
        if result_state not in tweets_by_state:
            tweets_by_state[result_state]= []
            tweets_by_state[result_state].append(tweet)
        else:
            tweets_by_state[result_state].append(tweet)
    return tweets_by_state

Solution

  • Every little enhancement in that huge for-loop would result in huge performance gain for time complexity when the number of tweets is extremely big, there are few things I can think of:

    1. Just call geo_distance() once especially when it's costly

    distance = geo_distance(position, find_state_center(us_states[state]))
    if distance < min:
         min = distance
    

    rather than

    if geo_distance(position, find_state_center(us_states[state]))< min:
        min = geo_distance(position, find_state_center(us_states[state]))
    

    2. If positions tend to be be repeated often:

    position_closest_state = {}  # save known result 
    tweets_by_state = {}
    for tweet in tweets:
        position = tweet_location(tweet)
        min, result_state = 100000, 'CA'
    
        if position in position_closest_state:
            result_state = position_closest_state[position]
        else:
            for state in us_states:
                distance = geo_distance(position, find_state_center(us_states[state]))
                if distance < min:
                    min = distance
                    result_state = state
                    position_closest_state[position] = result_state 
    

    So, let's say if you have 1000 tweets from 200 different positions, and us_states is 50, your origin algorithm would call geo_distance() 1000*50*2 times , now it can be reduced to 200*50*1 times of invocation.

    3. Reduce the times of invocation on find_state_center()

    Similar to #2, it is called redundantly for every tweet every state now.

    state_center_dict = {}
    for state in us_states:
        state_center_dict[state] = find_state_center(us_states[state])
    
    position_closest_state = {}  # save known result 
    tweets_by_state = {}
    for tweet in tweets:
        position = tweet_location(tweet)
        min, result_state = 100000, 'CA'
    
        if position in position_closest_state:
            result_state = position_closest_state[position]
        else:
            for state in us_states:
                distance = geo_distance(position, state_center_dict[state])
                if distance < min:
                    min = distance
                    result_state = state
                    position_closest_state[position] = result_state 
    

    Now find_state_center() is only called 50 times (number of state) rather than 50*1000 (number of tweets), we made another huge improvement!

    Performance achievement summary

    By #1, we double the performance. #2 we enhance it by (number of tweets/number of positions) times. #3 is the biggest among the three, we reduce the time complexity to only 1/(number of tweets) comparing with origin code.