Search code examples
pythonpandasnlpspacy

Error running my spacy summarization function on a text column in pandas dataframe


Below is a spacy function for the purpose of summarisation and I am trying to run this function through a pandas dataframe column and I am getting empty column everytime and I was hoping someone could help me figure this out?

def summarize(text, per):
    nlp = spacy.load('en_core_web_sm')
    doc= nlp(text)
    tokens=[token.text for token in doc]
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():                            
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]
    select_length=int(len(sentence_tokens)*per)
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

It also spits out an empty result for a sample text:

text = 'gov charlie crist launched amounts nuclear attack republican politics fox news sunday showdown marco rubio crist labeled rubio house speaker tax raiser forth record tax issues crist singled rubios failed 2007 plan eliminated property taxes floridians exchange increase state sales tax tax swap massive tax increase crist said march 28 2010 senate debate respect speaker youve got tell truth people thats rubio contends tax swap huge net tax cut plan supported gov jeb bush tax cut tax hike lets look months speaker early 2007 rubio proposed fundamental change floridas tax structure proposal scratch property taxes primary residences place state sales tax increased 25 cents dollar subject voter approval house analysis originally said swap save taxpayers total 58 billion year certainly contrary crists claim saved money spent money end year likely depended individual circumstances 2007 st petersburg times ran calculations rubios proposal homeowners renters homeowners family annual income 64280 home value 241100 current property tax tampa 506106 sales taxes paid 951 proposed property tax tampa 0 sales taxes paid 1290 rubios plan homeowners paid 4722 state taxes times contrast renters renters family annual income 46914 current rent 851 sales taxes paid 691 proposed rent 851 sales taxes paid 937 rubios plan renters pay additional 246 year taxes rental property owners pay property taxes meaning rent wouldnt affected talked swap swap owned home wouldnt pay tax anymore crist said debate percent fellow floridians renters applied enjoyed tax increase rubio responded renters opportunity buy exorbitant taxes pay property florida gone conversely rubio pointed increased sales tax bring revenue state nonresident visitors tourists contribute said floridians contribute rubios proposal got seal approval grover norquist president americans tax reform rubio supporter 2007 wrote legislators saying rubios tax swap proposal amounted net tax cut speaker rubios proposal net tax cut vote proposal constitute violation taxpayer protection pledge norquist wrote taxpayers florida reap benefits lower tax burden significant spending restraint state local level later house study said sales tax increase generate 93 billion exchange eliminating 158 billion property taxes heres house analysis swap combined tax initiatives tallahassee bunch politicians declare 7 billion net tax savings tax increase rep adam hasner rdelray beach told palm beach post vote proposal saying tax increase swap ultimately killed state senate crist spokeswoman andrea saul noted rubio said tax swap tax increase march 28 2010 debate according transcripts rubio said let tell supposed program raise taxes keeps talking probably largest tax increase floridas history eliminated property taxes sorts people supported jeb bush rubio spokesman alberto martinez said rubio mispoke shocking try distort martinez said based statements surround rubios largest tax increase line reasonable meant decrease crist said rubios tax swap proposal massive tax increase basic level rubios proposal tax increase tax decrease state sales tax property taxes micro level people pay pay macro level different studies said floridians paid 58 billion 65 billion generally leery tax impact projections suggestion rubios plan resulted tax increase statewide certainly massive crist suggests'
summarize(text)

I don't know if the function is wrong or is it something else, but then I tried to run it through the dataframe column and I get an empty column again:

df['spacy_summary'] = df['final'].apply(lambda x: summarize(x, 0.05))

So I guess it's the function? So any help is appreciated. Thank you!


Solution

  • The logic of your text summarization assumes that there are valid sentences which SpaCy will recognize but your example text doesn't provide that. SpaCy will likely just put it all in one long sentence, I don't think the text you fed into it would be split into multiple sentences. The sentence segmentation needs valid text input with punctuation marks etc. Try it with a text consisting of multiple sentences recognizable for SpaCy.

    That is combined with the fact that you use int(len(sentence_tokens)*per). int conversion rounds down to the next smaller full number. So int(1*0.05) = int(0.05) = 0, aka it returns 0 sentences. This happens for every text with less than 20 segmented sentences. So change this ratio or use something like max(1, int(len(sentence_tokens)*per)).

    I think other than that the code should generally work. I didn't look at every detail though. But I am not sure if you know exactly what it does: it summarizes by keeping only the per share of most representative full sentences, it doesn't change anything on word level.