Search code examples
pythonpandasdataframeexplode

pandas Explode producing unexpected results


I'm trying to explode a column of a dataframe to get multiple rows. The column to explode it's called keywords, which are a list of emotions returned as keywords from the package FlashText. This means if a keyword is in the text column (column with sentences), then it will return that emotion or multiple emotions corresponding to that sentence

If I use an example dataframe created by me, this works perfectly with an expected output, however when applied to the dataframe explode it returns a random combination of rows.

I thought this unexpected results were because the dataframes have duplicate indexes, however, drop them gaves the same wrong result.

Expected output

from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(keyword_dict=keywords_dict)


test_df = pd.DataFrame({'text': ['I really hate and love love everyone best confident shy', 'i should be sleeping i have a stressed out week coming to me',
                                 'late night snack glass of oj bc im quotdown with the sicknessquot then back to sleepugh i hate getting sick', 
                                 
                                 # NaN results to empty list
                                 'whatever', 
                                 '[]', 
                                 'body of missing northern calif girl found poli', 
                                 'i miss kenny powers',

                                 'sorry  tell them mea culpa from me and that i really am sorry'
                        ]
                        })

# Extracting keywords
test_df['keywords'] = test_df['text'].apply(lambda x: kp.extract_keywords(x, span_info=False))

# Exploding keywords column into rows
test_df = test_df.explode('keywords').reset_index(drop=True)#.drop('index', 1) # drop duplicate indexes

# Transforming NaN into empty list
test_df['keywords'] = test_df['keywords'].fillna({i: [] for i in test_df.index})


test_df
    text                                                keywords
0   I really hate and love love everyone best conf...   unfriendly
1   I really hate and love love everyone best conf...   friendly
2   I really hate and love love everyone best conf...   friendly
3   I really hate and love love everyone best conf...   confident
4   I really hate and love love everyone best conf...   insecure
5   i should be sleeping i have a stressed out wee...   neg_hp
6   late night snack glass of oj bc im quotdown wi...   unfriendly
7   whatever                                            []
8   []                                                  []
9   body of missing northern calif girl found poli      []
10  i miss kenny powers                                 []
11  sorry tell them mea culpa from me and that i ...    sadness
12  sorry tell them mea culpa from me and that i ...    sadness

Current output without explode

here the sentence i miss kenny powers return an empty list

enter image description here

Current output with explode

here the sentence i miss kenny powers return the emotion confident, which is wrong

enter image description here

Dataframe: dataframe sample 40k


Solution

  • Current solution that is working for me using csv package:

    # New solution : exploding with csv
    import csv
    
    CSV_PATH = 'temp_data.csv'
    data = []
    
    df_concat.to_csv(CSV_PATH)
    
    with open(file=CSV_PATH, mode='r') as f:
        reader = csv.DictReader(f)
        columns = reader.fieldnames
    
        print(columns)
    
        for record in reader:
            keywords = eval(record['keywords'])
    
            if not keywords:
                data.append((record['text'], '[]')) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
    
            for keyword in keywords:
                data.append((record['text'], keyword)) #record['category'], record['Valence'], record['Arousal'], record['Dominance']
    
    df_concat = pd.DataFrame(data, columns=['text', 'keywords'])