Search code examples
python-3.xtwitternetworkx

Twitter hashtags network with networkx


first of all I want to apologize because Im a newbie with Twitter data analysis.

I want to make a user hashtag network where I connect users depending on their tweet hashtags. I already have the tweets stored in a MongoDB but I couldnt extract all the hashtags from the extended entities object and to be honest Im kinda lost in how to do it, could you sugest could be the best way to achieve it?

I have tried storing the hashtags in a new column in the dataframe but I could only retrieve one, which doesn't work because I need to consider all the hashtags in the tweet to make the connections.

I have the following code to retrieve the hashtags in the second dataframe

def get_tweet_data(df2):
    df2["user_id"] = df1["user"].apply(lambda x: x["id"])
    df2["screen_name"] = df1["user"].apply(lambda x: x["screen_name"])
    df2["hashtags"] = df1["entities"].apply(lambda x: x["hashtags"][0]["text"] if x["hashtags"] else np.nan)
    return df2

which gives me as a result:

Dataframe2

Where im looking for something like this:

Expected result in dataframe2

But then I have another problem, I need to connect every tweet user according to their hashtags, so that user would have connections with users with #Puertos, users with #Pemex and users with #abierto. Which I don't how to do it.

To make the graph im using the following code:

G = nx.from_pandas_edgelist(
df2,
source = "screen_name",
target = "hashtags",
create_using = nx.Graph())

Again my apologies, I'm just starting with this.


Solution

  • Let's take it one step at a time. First you would like to extract the hashtags from each tweet. I like the second answer to this question for this task. In your context this would mean running something like:

    df['hashtags']=df['text'].map(lambda s: [i for i in s.split() if i.startswith("#") ])

    This will add a column in which each entry is a list of hashtags.

    The second step is a bit more involved. I would first create a bipartite network of users and hashtags. Edges would connect users with the hashtags they use. Then you can use NetworkX's bipartite projection functions to create a network of users with edges indicating shared hashtag use. Here is a sketch of how that might work:

    user_to_hashtags_dict=dict(df[['user_id','hashtags']].values) #a more convenient data structure: a dictionary with users as keys and the list of hashtags they use as values.
        B=nx.Graph() #create an empty graph
        for user in user_to_hashtags: #loop over all the users
            for hashtag in user_to_hashtags[user]: #for each user loop over the hashtags they use
                B.add_edge(user,hashtag) #add the edge User<->hashtag
    actual_users_with_hashtags = [x for x in list(set(df.user_id)) if x in B.nodes()] #create a list of users actually appearing in the network - perhaps some tweeting users never used a hashtag and we want to ignore them.
    G = nx.bipartite.weighted_projected_graph(B,nodes =actual_users_with_hashtags) #project the bipartite network onto the the users.
    

    G should be the network you are interested in, including weights on the edges between users counting the number of hashtags they use in common.