I am trying to make a python script that takes a sentence and summarize it in 5-7 words with keywords and details.
So far, I have used the nltk library to first remove any symbols and numbers, then remove all word types except nouns and verbs. I also included a function to remove all stopwords(words without value like, 'the', 'it).
The code I have is extremely basic and the output isn't grammatically correct or understandable. My main objective is to take a sentence like:
"Drug stability refers to the ability of a pharmaceutical product to retain its quality, safety, and efficacy over time"
...and turn it into:
"Drug stability is ability to retain quality, safety, and efficacy"
But when i run the code I get "Drug stability refers ability pharmaceutical product retain quality, safety, efficacy time" which isn't bad but I want to make the system able to produce more grammatically correct while still retaining major keywords. I am aware of libraries like gensin or nltk summarize but these libraries only take the important sentences of a paragraph through word frequency but this doesn't simplify single sentences. are there any other methods for sentence summarization?
Here is the code I have so far:
def shortenSentence(sentence):
#sentence = "%^Regulatory scientists must take measures to guarantee that the drug remains consistent and safe from the moment of production through packaging, storage, and shipping.907"
clean_sentence = re.sub(r'[^a-zA-Z\s]', '', sentence) # Added 0-9 and period (.)
#print(clean_sentence)
def remove_adj_adv(sentence):
words = word_tokenize(sentence)
pos_tags = pos_tag(words)
shortened = [word for word, tag in pos_tags if tag in ['NN', 'NNS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']]
return ' '.join(shortened)
shortened = remove_adj_adv(clean_sentence)
#print(shortened)
words = word_tokenize(shortened)
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))
# Remove stopwords from the list of words
filtered_words = [word for word in words if word.lower() not in stop_words]
# Join the filtered words back into a sentence
filtered_sentence = ' '.join(filtered_words)
#print(filtered_sentence)
return(filtered_sentence)
What if you had a larger sentence?
While your approach works summarizing longer sentences may pose additional issues.
I would recommend adding two extra functions to your algorithm.
First, to identify important words using a representation schema such as TFIDF.
Second, set a threshold of min 10 to max 20 words to be included in the summary.
Since you are pre-processing the input text to remove unwanted words (stopwords, adjectives, etc.), reconstructing the sentence to be syntactically or grammatically correct, is a different problem, which may require using generative LLMs.
Thus, you should shift your focus from summarization to text generation based on keywords.