I am in a situation where I am working with huggingface transformers and have got some insights into it. I am working with the facebook/bart-large-cnn model to perform text summarisation for my project and I am using the following code as of now to do some tests:
text = """
Justin Timberlake and Jessica Biel, welcome to parenthood.
The celebrity couple announced the arrival of their son, Silas Randall Timberlake, in
statements to People."""
from transformers import pipeline
smr_bart = pipeline(task="summarization", model="facebook/bart-large-cnn")
smbart = smr_bart(text, max_length=150)
print(smbart[0]['summary_text'])
The small peice of code is actually giving me a very good summary of the text. But my ask is that how can I apply the same pre trained model on top of my dataframe column. My dataframe looks like this:
ID Lang Text
1 EN some long text here...
2 EN some long text here...
3 EN some long text here...
.... and so on for 50K rows
Now I want to apply the pre trained model to the col Text to generate a new column df['summary'] from it and the resultant dataframe should look like:
ID Lang Text Summary
1 EN some long text here... Text summary goes here...
2 EN some long text here... Text summary goes here...
3 EN some long text here... Text summary goes here...
How can I achieve this? Any help would be much appreciated.
Something you can always do is utilizing the dataframe apply function:
df = pd.DataFrame([('EN',text)]*10, columns=['Lang','Text'])
df['summary'] = df.apply(lambda x: smr_bart(x['Text'], max_length=150)[0]['summary_text'] , axis=1)
df.head(3)
Output:
Lang Text summary
0 EN \nJustin Timberlake and Jessica Biel, welcome ... The celebrity couple announced the arrival of ...
1 EN \nJustin Timberlake and Jessica Biel, welcome ... The celebrity couple announced the arrival of ...
2 EN \nJustin Timberlake and Jessica Biel, welcome ... The celebrity couple announced the arrival of ...
That is a bit inefficient because the pipeline will be called for every row (execution time 2 minutes and 16 seconds). Therefore I recommend to cast the Text
column to a list and pass it to the pipeline directly (execution time 41 seconds):
df = pd.DataFrame([('EN',text)]*10, columns=['Lang','Text'])
df['summary'] = [x['summary_text'] for x in smr_bart(df['Text'].tolist(), max_length=150)]
df.head(3)
Output:
Lang Text summary
0 EN \nJustin Timberlake and Jessica Biel, welcome ... The celebrity couple announced the arrival of ...
1 EN \nJustin Timberlake and Jessica Biel, welcome ... The celebrity couple announced the arrival of ...
2 EN \nJustin Timberlake and Jessica Biel, welcome ... The celebrity couple announced the arrival of ...