Search code examples
pytorchnlphuggingface-transformershuggingface-tokenizershuggingface

Avoiding Trimmed Summaries of a PEGASUS-Pubmed huggingface summarization model


I am new to huggingface. I am using PEGASUS - Pubmed huggingface model to generate summary of the reserach paper. Following is the code for the same. the model gives a trimmed summary. Any way of avoiding the trimmed summaries and getting more concrete results in summarization.?

Following is the code that I tried.

#Loading Pubmed Dataset for Scientifc Articles

dataset_pubmed = load_dataset("scientific_papers","pubmed")

#Taking piece of  Train Dataset

sample_dataset = dataset_pubmed["train"]
sample_dataset

#Taking first two articles of Train Dataset
sample_dataset = sample_dataset['article'][:2]
sample_dataset

###Import PegasusModel and Tokenizer

from transformers import pipeline, PegasusTokenizer, PegasusForConditionalGeneration


model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-pubmed')
tokenizer =PegasusTokenizer.from_pretrained('google/pegasus-pubmed')

summerize_pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
pipe_out = summerize_pipe(sample_dataset, truncation=True)
pipe_out

As a results of this one of the summary output i get is as follows. The last sentence is not complete it gets trimmed for all the papers. How to avoid this.?

[{'summary_text': "background : in iran a national free food program ( nffp ) is implemented in elementary schools of deprived areas to cover all poor students . however , this program is not conducted in slums and poor areas of the big cities so many malnourished children with low socio - economic situation are not covered by nffp . therefore , the present study determines the effects of nutrition intervention in an advocacy process model on the prevalence of underweight in school aged children in the poor area of shiraz , iran.materials and methods : this interventional study has been carried out between 2009 and 2010 in shiraz , iran . in those schools all students ( 2897 , 7 - 13 years old ) were screened based on their body mass index ( bmi ) by nutritionists . according to convenience method all students divided to two groups based on their economic situation ; family revenue and head of household 's job and nutrition situation ; the first group were poor and malnourished students and the other group were well nourished or well - off students . for this report , the children 's height and weight were entered into center for disease control and prevention ( cdc ) to calculate bmi and bmi - for -"}


Solution

  • you should increase the max_length to a larger value, such as 1024 or 2048:

    summerize_pipe = pipeline("summarization", model=model, tokenizer=tokenizer, max_length=1024)