Search code examples
python-3.xnlphuggingface-transformerssummarizationhuggingface-tokenizers

Applying pre trained facebook/bart-large-cnn for text summarization in python


I am in a situation where I am working with huggingface transformers and have got some insights into it. I am working with the facebook/bart-large-cnn model to perform text summarisation for my project and I am using the following code as of now to do some tests:

text = """
Justin Timberlake and Jessica Biel, welcome to parenthood. 
The celebrity couple announced the arrival of their son, Silas Randall Timberlake, in 
statements to People."""

from transformers import pipeline
smr_bart = pipeline(task="summarization", model="facebook/bart-large-cnn")
smbart = smr_bart(text, max_length=150)
print(smbart[0]['summary_text'])

The small peice of code is actually giving me a very good summary of the text. But my ask is that how can I apply the same pre trained model on top of my dataframe column. My dataframe looks like this:

ID        Lang          Text
1         EN            some long text here...
2         EN            some long text here...
3         EN            some long text here...

.... and so on for 50K rows

Now I want to apply the pre trained model to the col Text to generate a new column df['summary'] from it and the resultant dataframe should look like:

ID        Lang         Text                              Summary
1         EN            some long text here...           Text summary goes here...
2         EN            some long text here...           Text summary goes here...
3         EN            some long text here...           Text summary goes here...

How can I achieve this? Any help would be much appreciated.


Solution

  • Something you can always do is utilizing the dataframe apply function:

    df = pd.DataFrame([('EN',text)]*10, columns=['Lang','Text'])
    
    df['summary'] = df.apply(lambda x: smr_bart(x['Text'], max_length=150)[0]['summary_text'] , axis=1)
    
    df.head(3)
    

    Output:

        Lang    Text                                                summary
    0   EN      \nJustin Timberlake and Jessica Biel, welcome ...   The celebrity couple announced the arrival of ...
    1   EN      \nJustin Timberlake and Jessica Biel, welcome ...   The celebrity couple announced the arrival of ...
    2   EN      \nJustin Timberlake and Jessica Biel, welcome ...   The celebrity couple announced the arrival of ...
    

    That is a bit inefficient because the pipeline will be called for every row (execution time 2 minutes and 16 seconds). Therefore I recommend to cast the Text column to a list and pass it to the pipeline directly (execution time 41 seconds):

    df = pd.DataFrame([('EN',text)]*10, columns=['Lang','Text'])
    
    df['summary'] = [x['summary_text'] for x in smr_bart(df['Text'].tolist(), max_length=150)]
    
    df.head(3)
    

    Output:

        Lang    Text                                                summary
    0   EN      \nJustin Timberlake and Jessica Biel, welcome ...   The celebrity couple announced the arrival of ...
    1   EN      \nJustin Timberlake and Jessica Biel, welcome ...   The celebrity couple announced the arrival of ...
    2   EN      \nJustin Timberlake and Jessica Biel, welcome ...   The celebrity couple announced the arrival of ...