pandas retaining column after flattening json

I have a pandas dataframe where each row corresponds to a news article, and each cell in the row contains one dictionary that contains stock ticker information as well as some metrics. The dataframe also contains a time_published column, which contains the publishing date and time corresponding to the row for the article.

I flattened the table into a more workable format, but due to the way I did it, I'm now missing the time_published data in my final table. I'm not quite sure how to make it so that I can keep it.

This is the original table:

index	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	time_published
0	{'ticker': 'RYCEF', 'relevance_score': '0.079921', 'ticker_sentiment_score': '0.091132', 'ticker_sentiment_label': 'Neutral'}			null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	20240211T042031
1				null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	20240211T034349
2	{'ticker': 'FOREX:IDR', 'relevance_score': '0.047371', 'ticker_sentiment_score': '0.0', 'ticker_sentiment_label': 'Neutral'}	{'ticker': 'FOREX:JPY', 'relevance_score': '0.094576', 'ticker_sentiment_score': '0.079538', 'ticker_sentiment_label': 'Neutral'}		null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	20240211T033015
3				null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	20240211T031049
4	{'ticker': 'BLK', 'relevance_score': '0.080188', 'ticker_sentiment_score': '-0.036152', 'ticker_sentiment_label': 'Neutral'}	{'ticker': 'GOOG', 'relevance_score': '0.080188', 'ticker_sentiment_score': '0.0', 'ticker_sentiment_label': 'Neutral'}	{'ticker': 'CRYPTO:BTC', 'relevance_score': '0.237358', 'ticker_sentiment_score': '0.229927', 'ticker_sentiment_label': 'Somewhat-Bullish'}	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	20240211T023653

I used the following code chunk to flatten the table into a workable format:

df_final = pd.DataFrame() 
for i in range(0, 41):
    df_flat_ticker_temp = pd.json_normalize(df_flat_ticker[i])
    df_final = pd.concat([df_final, df_flat_ticker_temp])
    df_final = df_final.dropna()

Resulting in this table:

index	ticker	relevance_score	ticker_sentiment_score	ticker_sentiment_label
0	RYCEF	0.079921	0.091132	Neutral
2	FOREX:IDR	0.047371	0.0	Neutral
4	BLK	0.080188	-0.036152	Neutral
5	NSC	0.329551	-0.319278	Somewhat-Bearish
6	FOREX:EUR	0.043499	-0.222413	Somewhat-Bearish
7	FOREX:EUR	0.259727	0.11301	Neutral
8	VZ	0.390613	0.331239	Somewhat-Bullish
9	BABA	0.064449	0.0	Neutral
10	ADM	0.376704	0.166407	Somewhat-Bullish
11	EVGGF	0.098255	-0.010632	Neutral

I am happy with this table, but the only problem is that I cannot find a way to preserve the time_published data due to the way the table was originally formatted. My difficulty with this task arises from the fact that in the original table, an entry in the time_published column corresponds with all of the dictionaries in that row, but in my final table each of those dictionaries is now a separate row. I'd appreciate any help or advice from someone more skilled at pandas than I am on how to do this.

Solution

You could add the time_published column back in to each set of columns you append to df_final:

for i in range(42):
    df_flat_ticker_temp = pd.json_normalize(df_flat_ticker[i])
    df_final = pd.concat([df_final, pd.concat([df_flat_ticker_temp, df['time_published']], axis=1)])

df_final = df_final.dropna()

Output for your sample data:

       ticker relevance_score ticker_sentiment_score ticker_sentiment_label   time_published
0       RYCEF        0.079921               0.091132                Neutral  20240211T042031
2   FOREX:IDR        0.047371                    0.0                Neutral  20240211T033015
4         BLK        0.080188              -0.036152                Neutral  20240211T023653
2   FOREX:JPY        0.094576               0.079538                Neutral  20240211T033015
4        GOOG        0.080188                    0.0                Neutral  20240211T023653
4  CRYPTO:BTC        0.237358               0.229927       Somewhat-Bullish  20240211T023653

Note that you need range(42) to cover all your columns (up to 41).

Also you could use assign instead of a nested concat:

df_final = pd.concat([df_final, df_flat_ticker_temp.assign(time_published=df['time_published'])])