Search code examples
pythonpandasnlpspacytextacy

More efficient implementation of Textacy / spacy 'subject_verb_object_triples'


I'm trying to implement the 'extract.subject_verb_object_triples' funcation from textacy on my dataset. However, the code I have written is very slow and memory intensive. Is there a more efficient implementation?

import spacy
import textacy

def extract_SVO(text):

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    tuples = textacy.extract.subject_verb_object_triples(doc)
    tuples_to_list = list(tuples)
    if tuples_to_list != []:
        tuples_list.append(tuples_to_list)

tuples_list = []          
sp500news['title'].apply(extract_SVO)
print(tuples_list)

Sample data (sp500news)

    date_publish  \
0       2013-05-14 17:17:05   
1       2014-05-09 20:15:57   
4       2018-07-19 10:29:54   
6       2012-04-17 21:02:54   
8       2012-12-12 20:17:56   
9       2018-11-08 10:51:49   
11      2013-08-25 07:13:31   
12      2015-01-09 00:54:17   

 title  
0       Italy will not dismantle Montis labour reform  minister                            
1       Exclusive US agency FinCEN rejected veterans in bid to hire lawyers                
4       Xis campaign to draw people back to graying rural China faces uphill battle        
6       Romney begins to win over conservatives                                            
8       Oregon mall shooting survivor in serious condition                                 
9       Polands PGNiG to sign another deal for LNG supplies from US CEO                    
11      Australias opposition leader pledges stronger economy if elected PM                
12      New York shifts into Code Blue to get homeless off frigid streets                  

Solution

  • This should speed it somewhat -

    import spacy
    import textacy
    nlp = spacy.load('en_core_web_sm')
    def extract_SVO(text):
        tuples = textacy.extract.subject_verb_object_triples(text)
        if tuples:
            tuples_to_list = list(tuples)
            tuples_list.append(tuples_to_list)
    
    tuples_list = []          
    sp500news['title'] = sp500news['title'].apply(nlp)
    _ = sp500news['title'].apply(extract_SVO)
    print(tuples_list)
    

    Explanation

    In OP imlementation, nlp = spacy.load('en_core_web_sm') is called so from inside the function it loads everytime. I sense this is the biggest bottleneck. This can be taken out and it should speed it up.

    Also, the tuple casting to list can happen only if the tuple is not empty.