Search code examples
pythonnlpgensim

why does gensim summarize() return blank sometimes?


I'm beginner at nlp and I'm using gensim for the first time. I noticed that some text it returns a blank summary. For example:

from gensim.summarization import summarize
text ="The continued digitization of most every sector of society and industry means that an ever-growing volume of data will continue to be generated. The ability to gain insights from these vast datasets is one key to addressing an enormous array of issues — from identifying and treating diseases more effectively, to fighting cyber criminals, to helping organizations operate more effectively to boost the bottom line."
summarize(text, 0.6)

returns: ''

When I have equivalent sized paragraphs in other instances it returns a summary, so I know it's not that my ratio is too small. Any insights appreciated!


Solution

  • For the sake of the answer I'll assume Gensim version 3.8.3 - this is the latest version that (currently) supports summarization, since there are no API stubs in version 4 anymore.

    Specifically, when looking at the reference for summarize(), we can read the following:

    Get a summarized version of the given text.
    The output summary will consist of the most representative sentences and will be returned as a string, divided by newlines.

    The highlighted part also explains why your output is empty: Gensim employs an extractive summarizer, which can only choose different sentences, not sentence parts. Therefore, either the entire sentence is selected (resulting in no "summarization"), or return the empty answer. Fixing this problem is also not trivial, and I think you have only one of two (sub-optimal) choices:

    • Employ an abstractive summarizer. Compared to extractive summarization, abstractive models can actually do what humans usually "expect" from a system, namely re-wording and selection of phrases from a sentence to form a shorter output, without relying on the selection of sentences. However, such models are usually quite compute-intensive, and there is no such model available through Gensim (AFAIK).
    • Pre-chunk your text. If you can achieve a reasonable segmentation of your input sentence into several chunks of text, these can be a stand-in for "multiple sentences", and therefore would allow you to have an approximate summary, even though it probably isn't very good.