Search code examples
pythonpandastextpython-refindall

How to also extract sentence before and after sentence with keyword or substring?


I would like to create a function that extracts a sentence that contains a keyword or substring of interest, as well as the sentences before and after. If possible I would like to specify the number of sentences I would like to extract. I hope that the function will not return an error if the key sentence is the first sentence.

In the example below, I created a function that extracts a single sentence. How can I expand to include more sentences?


data = [[0, 'Johannes Gensfleisch zur Laden zum Gutenberg was a German inventor, printer, publisher, and goldsmith who introduced printing to Europe with his mechanical movable-type printing press. His work started the Printing Revolution in Europe and is regarded as a milestone of the second millennium, ushering in the modern period of human history. It played a key role in the development of the Renaissance, Reformation, Age of Enlightenment, and Scientific Revolution, as well as laying the material basis for the modern knowledge-based economy and the spread of learning to the masses.'], 
[1, 'While not the first to use movable type in the world,[a] Gutenberg was the first European to do so. His many contributions to printing include the invention of a process for mass-producing movable type; the use of oil-based ink for printing books;[7] adjustable molds;[8] mechanical movable type; and the use of a wooden printing press similar to the agricultural screw presses of the period.[9] His truly epochal invention was the combination of these elements into a practical system that allowed the mass production of printed books and was economically viable for printers and readers alike. Gutenbergs method for making type is traditionally considered to have included a type metal alloy and a hand mould for casting type. The alloy was a mixture of lead, tin, and antimony that melted at a relatively low temperature for faster and more economical casting, cast well, and created a durable type.'], 
[2, 'The use of movable type was a marked improvement on the handwritten manuscript, which was the existing method of book production in Europe, and upon woodblock printing, and revolutionized European book-making. Gutenbergs printing technology spread rapidly throughout Europe and later the world. His major work, the Gutenberg Bible (also known as the 42-line Bible), was the first printed version of the Bible and has been acclaimed for its high aesthetic and technical quality. In Renaissance Europe, the arrival of mechanical movable type printing introduced the era of mass communication which permanently altered the structure of society. The relatively unrestricted circulation of information—including revolutionary ideas—transcended borders, captured the masses in the Reformation, and threatened the power of political and religious authorities; the sharp increase in literacy broke the monopoly of the literate elite on education and learning and bolstered the emerging middle class. Across Europe, the increasing cultural self-awareness of its people led to the rise of proto-nationalism, accelerated by the flowering of the European vernacular languages to the detriment of Latins status as lingua franca. In the 19th century, the replacement of the hand-operated Gutenberg-style press by steam-powered rotary presses allowed printing on an industrial scale, while Western-style printing was adopted all over the world, becoming practically the sole medium for modern bulk printing. ']]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['text_number', 'text'])

def extract_key_sentences(text,word_list):
    joined_word_list = '|'.join(word_list)
    print(joined_word_list)
    sentence = re.findall(r'([^.]*?'+joined_word_list+'[^.]*\.)', text)
    return sentence

tools_list=['printing press','paper','ink','woodblock','molds','method']
df['key_sentence']=df['text'].apply(lambda x : extract_key_sentences(str(x),tools_list))
df['key_sentence'].head()

With the current approach, it seems that the full sentence is not being extracted, see row 3, the text starts with 'method of book production'.


Solution

  • IIUC, you can split your strings and explode to have one sentence per row, identify the sentences matching the keywords, and use a groupby.rolling.max to propagate to the neighboring sentences.

    Then aggregate back as a single string (optional):

    word_list=['printing press','paper','ink','woodblock','molds','method']
    joined_word_list = '|'.join(map(re.escape, word_list))
    
    N = 3
    
    df[f'{N}_around'] = (df['text']
     .str.split('(?<=\.)\s*').explode()
     .loc[lambda d: 
          d.str.contains(joined_word_list)
           .groupby(level=0).rolling(2*N+1, center=True, min_periods=1).max()
           .droplevel(1).astype(bool)
         ]
     .groupby(level=0).agg(' '.join)
    )
    

    In this particular case, for the third row, it matches only the first sentence and propagates to keep the following 3 ones, dropping the remaining 4.

    output:

       text_number                                               text  \
    0            0  Johannes Gensfleisch zur Laden zum Gutenberg w...   
    1            1  While not the first to use movable type in the...   
    2            2  The use of movable type was a marked improveme...   
    
                                                3_around  
    0  Johannes Gensfleisch zur Laden zum Gutenberg w...  
    1  While not the first to use movable type in the...  
    2  The use of movable type was a marked improveme...