Search code examples
pythonalgorithmnlp

NLP sentence summarization techniques with python


I am trying to make a python script that takes a sentence and summarize it in 5-7 words with keywords and details.

So far, I have used the nltk library to first remove any symbols and numbers, then remove all word types except nouns and verbs. I also included a function to remove all stopwords(words without value like, 'the', 'it).

The code I have is extremely basic and the output isn't grammatically correct or understandable. My main objective is to take a sentence like:

"Drug stability refers to the ability of a pharmaceutical product to retain its quality, safety, and efficacy over time"

...and turn it into:

"Drug stability is ability to retain quality, safety, and efficacy"

But when i run the code I get "Drug stability refers ability pharmaceutical product retain quality, safety, efficacy time" which isn't bad but I want to make the system able to produce more grammatically correct while still retaining major keywords. I am aware of libraries like gensin or nltk summarize but these libraries only take the important sentences of a paragraph through word frequency but this doesn't simplify single sentences. are there any other methods for sentence summarization?

Here is the code I have so far:

def shortenSentence(sentence):
    #sentence = "%^Regulatory scientists must take measures to guarantee that the drug remains consistent and safe from the moment of production through packaging, storage, and shipping.907"
    clean_sentence = re.sub(r'[^a-zA-Z\s]', '', sentence)  # Added 0-9 and period (.)
    #print(clean_sentence)


    def remove_adj_adv(sentence):
        words = word_tokenize(sentence)
        pos_tags = pos_tag(words)
        shortened = [word for word, tag in pos_tags if tag in ['NN', 'NNS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']]
        return ' '.join(shortened)
    shortened = remove_adj_adv(clean_sentence)
    #print(shortened)



    words = word_tokenize(shortened)
    # Get the list of English stopwords
    stop_words = set(stopwords.words('english'))
    # Remove stopwords from the list of words
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Join the filtered words back into a sentence
    filtered_sentence = ' '.join(filtered_words)
    #print(filtered_sentence)
    return(filtered_sentence)

Solution

  • What if you had a larger sentence?

    While your approach works summarizing longer sentences may pose additional issues.

    I would recommend adding two extra functions to your algorithm.

    First, to identify important words using a representation schema such as TFIDF.

    Second, set a threshold of min 10 to max 20 words to be included in the summary.

    Since you are pre-processing the input text to remove unwanted words (stopwords, adjectives, etc.), reconstructing the sentence to be syntactically or grammatically correct, is a different problem, which may require using generative LLMs.

    Thus, you should shift your focus from summarization to text generation based on keywords.