I have a pandas dataframe where each row corresponds to a news article, and each cell in the row contains one dictionary that contains stock ticker information as well as some metrics. The dataframe also contains a time_published column, which contains the publishing date and time corresponding to the row for the article.
I flattened the table into a more workable format, but due to the way I did it, I'm now missing the time_published data in my final table. I'm not quite sure how to make it so that I can keep it.
This is the original table:
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | time_published |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'ticker': 'RYCEF', 'relevance_score': '0.079921', 'ticker_sentiment_score': '0.091132', 'ticker_sentiment_label': 'Neutral'} | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 20240211T042031 | ||
1 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 20240211T034349 | |||
2 | {'ticker': 'FOREX:IDR', 'relevance_score': '0.047371', 'ticker_sentiment_score': '0.0', 'ticker_sentiment_label': 'Neutral'} | {'ticker': 'FOREX:JPY', 'relevance_score': '0.094576', 'ticker_sentiment_score': '0.079538', 'ticker_sentiment_label': 'Neutral'} | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 20240211T033015 | |
3 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 20240211T031049 | |||
4 | {'ticker': 'BLK', 'relevance_score': '0.080188', 'ticker_sentiment_score': '-0.036152', 'ticker_sentiment_label': 'Neutral'} | {'ticker': 'GOOG', 'relevance_score': '0.080188', 'ticker_sentiment_score': '0.0', 'ticker_sentiment_label': 'Neutral'} | {'ticker': 'CRYPTO:BTC', 'relevance_score': '0.237358', 'ticker_sentiment_score': '0.229927', 'ticker_sentiment_label': 'Somewhat-Bullish'} | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 20240211T023653 |
I used the following code chunk to flatten the table into a workable format:
df_final = pd.DataFrame()
for i in range(0, 41):
df_flat_ticker_temp = pd.json_normalize(df_flat_ticker[i])
df_final = pd.concat([df_final, df_flat_ticker_temp])
df_final = df_final.dropna()
Resulting in this table:
index | ticker | relevance_score | ticker_sentiment_score | ticker_sentiment_label |
---|---|---|---|---|
0 | RYCEF | 0.079921 | 0.091132 | Neutral |
2 | FOREX:IDR | 0.047371 | 0.0 | Neutral |
4 | BLK | 0.080188 | -0.036152 | Neutral |
5 | NSC | 0.329551 | -0.319278 | Somewhat-Bearish |
6 | FOREX:EUR | 0.043499 | -0.222413 | Somewhat-Bearish |
7 | FOREX:EUR | 0.259727 | 0.11301 | Neutral |
8 | VZ | 0.390613 | 0.331239 | Somewhat-Bullish |
9 | BABA | 0.064449 | 0.0 | Neutral |
10 | ADM | 0.376704 | 0.166407 | Somewhat-Bullish |
11 | EVGGF | 0.098255 | -0.010632 | Neutral |
I am happy with this table, but the only problem is that I cannot find a way to preserve the time_published data due to the way the table was originally formatted. My difficulty with this task arises from the fact that in the original table, an entry in the time_published column corresponds with all of the dictionaries in that row, but in my final table each of those dictionaries is now a separate row. I'd appreciate any help or advice from someone more skilled at pandas than I am on how to do this.
You could add the time_published
column back in to each set of columns you append to df_final
:
for i in range(42):
df_flat_ticker_temp = pd.json_normalize(df_flat_ticker[i])
df_final = pd.concat([df_final, pd.concat([df_flat_ticker_temp, df['time_published']], axis=1)])
df_final = df_final.dropna()
Output for your sample data:
ticker relevance_score ticker_sentiment_score ticker_sentiment_label time_published
0 RYCEF 0.079921 0.091132 Neutral 20240211T042031
2 FOREX:IDR 0.047371 0.0 Neutral 20240211T033015
4 BLK 0.080188 -0.036152 Neutral 20240211T023653
2 FOREX:JPY 0.094576 0.079538 Neutral 20240211T033015
4 GOOG 0.080188 0.0 Neutral 20240211T023653
4 CRYPTO:BTC 0.237358 0.229927 Somewhat-Bullish 20240211T023653
Note that you need range(42)
to cover all your columns (up to 41).
Also you could use assign
instead of a nested concat
:
df_final = pd.concat([df_final, df_flat_ticker_temp.assign(time_published=df['time_published'])])