I have a data frame with 2 columns. One is the source text, the other is sequences of (keyword, score)
pairs generated by KeyBERT from the source, for example:
[('white butterfly', 0.4587),
('pest horseradish', 0.4974),
('mature caterpillars', 0.6484)]
What I want to achieve is to have two columns, one representing all the keywords and the other representing the scores, like this:
df2data = [["The word horseradish is attested in English from the 1590s. It combines the word horse (formerly used in a figurative sense to mean strong or coarse) and the word radish. Some sources claim that the term originates from a mispronunciation of the German word meerrettich as mareradish. However, this hypothesis has been disputed, as there is no historical evidence of this term being used. In Central and Eastern Europe, horseradish is called chren, hren and ren (in various spellings like kren) in many Slavic languages, in Austria, in parts of Germany (where the other German name Meerrettich is not used), in North-East Italy, and in Yiddish (כריין transliterated as khreyn). It is common in Ukraine (under the name of хрін, khrin), in Belarus (under the name of хрэн, chren), in Poland (under the name of chrzan), in the Czech Republic (křen), in Slovakia (chren), in Russia (хрен, khren), in Hungary (torma), in Romania (hrean), in Lithuania (krienas), and in Bulgaria (under the name of хрян)",
['mispronunciation german', 'mareradish hypothesis', 'horse used'],
[0.3715, 0.422, 0.4594]],
["Widely introduced by accident, cabbageworms, the larvae of Pieris rapae, the small white butterfly, are a common caterpillar pest in horseradish. The adults are white butterflies with black spots on the forewings that are commonly seen flying around plants during the day. The caterpillars are velvety green with faint yellow stripes running lengthwise down the back and sides. Fully grown caterpillars are about 25-millimetre (1 in) in length. They move sluggishly when prodded. They overwinter in green pupal cases. Adults start appearing in gardens after the last frost and are a problem through the remainder of the growing season. There are three to five overlapping generations a year. Mature caterpillars chew large, ragged holes in the leaves leaving the large veins intact. Handpicking is an effective control strategy in home gardens",
['white butterfly', 'pest horseradish', 'mature caterpillars'],
[0.4587, 0.4974, 0.6484]]]
df2 = pd.DataFrame(data = df2data, columns=['texts', 'words', 'scores'])
words | scores | texts |
---|---|---|
[mispronunciation german, mareradish hypothesi...] | [0.3715, 0.422, 0.4594] | The word horseradish is attested in English from the 1590s... |
[white butterfly, pest horseradish, mature cat...] | [0.4587, 0.4974, 0.6484] | Widely introduced by accident, cabbageworms, the larvae of Pieris rapae, the... |
I have tried indexing the words column but only manage to slide within the series itself.
Code to generate a reproducible example:
import pandas as pd
from keybert import KeyBERT
kw_model = KeyBERT()
text = ["The word horseradish is attested in English from the 1590s. It combines the word horse (formerly used in a figurative sense to mean strong or coarse) and the word radish. Some sources claim that the term originates from a mispronunciation of the German word meerrettich as mareradish. However, this hypothesis has been disputed, as there is no historical evidence of this term being used. In Central and Eastern Europe, horseradish is called chren, hren and ren (in various spellings like kren) in many Slavic languages, in Austria, in parts of Germany (where the other German name Meerrettich is not used), in North-East Italy, and in Yiddish (כריין transliterated as khreyn). It is common in Ukraine (under the name of хрін, khrin), in Belarus (under the name of хрэн, chren), in Poland (under the name of chrzan), in the Czech Republic (křen), in Slovakia (chren), in Russia (хрен, khren), in Hungary (torma), in Romania (hrean), in Lithuania (krienas), and in Bulgaria (under the name of хрян)","Widely introduced by accident, cabbageworms, the larvae of Pieris rapae, the small white butterfly, are a common caterpillar pest in horseradish. The adults are white butterflies with black spots on the forewings that are commonly seen flying around plants during the day. The caterpillars are velvety green with faint yellow stripes running lengthwise down the back and sides. Fully grown caterpillars are about 25-millimetre (1 in) in length. They move sluggishly when prodded. They overwinter in green pupal cases. Adults start appearing in gardens after the last frost and are a problem through the remainder of the growing season. There are three to five overlapping generations a year. Mature caterpillars chew large, ragged holes in the leaves leaving the large veins intact. Handpicking is an effective control strategy in home gardens"]
df = pd.DataFrame(text, columns=['texts'])
df['words'] = kw_model.extract_keywords(df['texts'], keyphrase_ngram_range=(1, 2), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=3)
One of the ways is: transpose inner sequences (with zip
?) and apply pandas.Series
to them:
df = pd.DataFrame({
'words': [
[('mispronunciation german', 0.3715),
('mareradish hypothesis', 0.422),
('horse used', 0.4594)],
[('white butterfly', 0.4587),
('pest horseradish', 0.4974),
('mature caterpillars', 0.6484)],
[('test phrase', 0.1234)],
]
})
splitted_values = df['words'].apply(lambda x: pd.Series([*zip(*x)], ['words','scores']))
print(splitted_values.to_string()
Output:
words scores
0 (mispronunciation german, mareradish hypothesis, horse used) (0.3715, 0.422, 0.4594)
1 (white butterfly, pest horseradish, mature caterpillars) (0.4587, 0.4974, 0.6484)
2 (test phrase,) (0.1234,)
Another one: explode pairs, separate them individually, group by index and aggregate with list
:
df[['words', 'scores']] = (
df['words']
.explode()
.apply(pd.Series, index=['words', 'scores'])
.groupby(level=0)
.agg(list)
)