I'm trying to explode a column of a dataframe to get multiple rows. The column to explode it's called keywords, which are a list of emotions returned as keywords from the package FlashText. This means if a keyword is in the text column (column with sentences), then it will return that emotion or multiple emotions corresponding to that sentence
If I use an example dataframe created by me, this works perfectly with an expected output, however when applied to the dataframe explode it returns a random combination of rows.
I thought this unexpected results were because the dataframes have duplicate indexes, however, drop them gaves the same wrong result.
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)
test_df = pd.DataFrame({'text': ['I really hate and love love everyone best confident shy', 'i should be sleeping i have a stressed out week coming to me',
'late night snack glass of oj bc im quotdown with the sicknessquot then back to sleepugh i hate getting sick',
# NaN results to empty list
'whatever',
'[]',
'body of missing northern calif girl found poli',
'i miss kenny powers',
'sorry tell them mea culpa from me and that i really am sorry'
]
})
# Extracting keywords
test_df['keywords'] = test_df['text'].apply(lambda x: kp.extract_keywords(x, span_info=False))
# Exploding keywords column into rows
test_df = test_df.explode('keywords').reset_index(drop=True)#.drop('index', 1) # drop duplicate indexes
# Transforming NaN into empty list
test_df['keywords'] = test_df['keywords'].fillna({i: [] for i in test_df.index})
test_df
text keywords
0 I really hate and love love everyone best conf... unfriendly
1 I really hate and love love everyone best conf... friendly
2 I really hate and love love everyone best conf... friendly
3 I really hate and love love everyone best conf... confident
4 I really hate and love love everyone best conf... insecure
5 i should be sleeping i have a stressed out wee... neg_hp
6 late night snack glass of oj bc im quotdown wi... unfriendly
7 whatever []
8 [] []
9 body of missing northern calif girl found poli []
10 i miss kenny powers []
11 sorry tell them mea culpa from me and that i ... sadness
12 sorry tell them mea culpa from me and that i ... sadness
here the sentence i miss kenny powers
return an empty list
here the sentence i miss kenny powers
return the emotion confident
, which is wrong
Dataframe: dataframe sample 40k
Current solution that is working for me using csv package:
# New solution : exploding with csv
import csv
CSV_PATH = 'temp_data.csv'
data = []
df_concat.to_csv(CSV_PATH)
with open(file=CSV_PATH, mode='r') as f:
reader = csv.DictReader(f)
columns = reader.fieldnames
print(columns)
for record in reader:
keywords = eval(record['keywords'])
if not keywords:
data.append((record['text'], '[]')) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
for keyword in keywords:
data.append((record['text'], keyword)) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
df_concat = pd.DataFrame(data, columns=['text', 'keywords'])