Search code examples
pythonpandastwittermulti-index

convert single index pandas data frame to multi-index


I have a data frame with following structure:

df.columns
Index(['first_post_date', 'followers_count', 'friends_count',
       'last_post_date','min_retweet', 'retweet_count', 'screen_name',
       'tweet_count',  'tweet_with_max_retweet', 'tweets', 'uid'],
        dtype='object')

Inside the tweets series, each cell is another data frame containing all the tweets of an user.

df.tweets[0].columns
Index(['created_at', 'id', 'retweet_count', 'text'], dtype='object')

I want to convert this data frame to a multi-index frame, essentially by breaking the cell containing tweets. One index will be the uid, and another will be the id inside tweet.

How can I do that?

link to sample data


Solution

  • So from df, you have tweets columns which contain df of tweets, so I create a tweets_df dataframe and concat all the df in tweets to tweets_df, add uid column to know which uid that tweet belong to, then merge info of uid to tweets_df for further process if needed. Please comment if you need further modification. It is hard to get your sample data and convert to json. So I did this on guessing, hope it still gives you some ideas.

    import pandas as pd
    
    df = .... #your df
    
    tweets_df = pd.DataFrame() #create blank df to contain tweets
    
    # explode tweets to df
    ## loop each uid
    for uid in df['uid']:
        temp = df.loc[df['uid']==uid, :] # select df by uid
        temp = temp['tweets'].iloc[0] # select tweets column -> df
        temp['uid'] = uid # add uid column to know tweets belong to which uid
        tweets_df = pd.concat([results, temp], ignore_index=True) # concat to container df
    
    # get a uid info df from starting df
    uid_info_column = df.columns
    uid_info_column.remove('tweets')
    uid_info_df = df.loc[:, uid_info_column]
    
    
    # merge info on uid with tweets_df
    final = pd.merge(left=tweets_df, right=uid_info_df, on='uid', how='outer')