Search code examples
pythonspacytext-processing

Return sentences from list of sentences using user specified keyword


I got a list of sentences (roughly 20000) stored in excel file named list.xlsx and sheet named Sentence under column name named Sentence.

My intention is to get words from user and return those sentences where in those exact words matches.

I am currently able to do so with the code i developed using spacy. But it takes lot of time to check and return output.

Is there any other time saving way of achieving this by any other means.

I see in geany notepad or libre calc wherein its search function return sentences in a jiffy.

How?

kindly help.

import pandas as pd
import spacy

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Function to extract sentences containing the keyword
def extract_sentences_with_keyword(text, keyword):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents if keyword in sent.text.lower()]
    return sentences

i = input("Enter Keyword(s):")

# Read the Excel file
file_path = "list.xlsx"
sheet_name = "Sentence"  # Update with your sheet name
column_name = "Sentence"   # Update with the column containing text data

data = pd.read_excel(file_path, sheet_name=sheet_name)



# Iterate over the rows and extract sentences with the keyword
keyword = i  # Update with the keyword you want to search for
for index, row in data.iterrows():
    text = row[column_name]
    sentences = extract_sentences_with_keyword(text, keyword)
    
    if sentences:
        for sentence in sentences:
            print(sentence)
        print("\n")

Solution

  • You can use Sqlite with a full text index. I tried the following proof of concept code with a 6 MB text file and it is very fast. You of course need to adjust the code for your needs, using spacy for sentence splitting as you did above might be a decent option:

    import sqlite3
    import re
    
    conn = sqlite3.connect(':memory:')
    cursor = conn.cursor()
    
    cursor.execute('CREATE VIRTUAL TABLE fts_sentences USING fts5(content)')
    
    def load_and_split_file(file_path):
        sentence_endings = r'[.!?]\s+|\s*$'
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
            sentences = re.split(sentence_endings, text)
            return sentences
    
    def insert_sentences(sentences):
        for sentence in sentences:
            cursor.execute('INSERT INTO fts_sentences (content) VALUES (?)', (sentence,))
        conn.commit()
    
    def search_word(word):
        cursor.execute('SELECT content FROM fts_sentences WHERE fts_sentences MATCH ?', (word,))
        return cursor.fetchall()
    
    file_path = 'big.txt' 
    sentences = load_and_split_file(file_path)
    insert_sentences(sentences)
    
    while True:
        word_to_search = input('Enter a word to search for: ')
        matching_sentences = search_word(word_to_search)
    
        for sentence in matching_sentences:
            print(sentence[0])
    

    Your code using spacy is also very slow because you do not disable any pipelines, so it also performs stuff like part of speech detection, which you do not need for your use case. For details you can look here: https://spacy.io/usage/processing-pipelines

    Quoting from the docs (you might need to disable more or less pipelines):

    import spacy
    
    texts = [
        "Net income was $9.4 million compared to the prior year of $2.7 million.",
        "Revenue exceeded twelve billion dollars, with a loss of $1b.",
    ]
    
    nlp = spacy.load("en_core_web_sm")
    for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
        # Do something with the doc here
        print([(ent.text, ent.label_) for ent in doc.ents])