Search code examples
pythondataframelambdanlpspacy

Split data frame of comments into multiple rows


I have a data frame with long comments and I want to split them into indiviual sentences using spacy sentencizer.

Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')  
Comments
>>>
         reviews
    0    One of the rare films where every discussion leaving the theater is about how much you 
         just had, instead of an analysis of its quotients.
    1    Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving, 
         and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that 
         re-watchability factor.

I loaded the model like this

import spacy
nlp = spacy.load("en_core_news_sm")

And using sentencizer

from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))

But when I check the sentence is in just one row like this

[One of the rare films where every discussion leaving the theater is about how much you just had.,
 Instead of an analysis of its quotients.]

Thanks a lot for any help. I'm new using NLP tools in Data Frame.


Solution

  • Currently, Data is a Series whose rows are lists of sentences, or actually, lists of Spacy's Span objects. You probably want to obtain the text of these sentences and to put each sentence on a different row.

    comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
                {'reviews': 'This is the first sentence of the second review. And this is the second.'}]
    
    comments = pd.DataFrame(comments) # building your input DataFrame
    
    +----+--------------------------------------------------------------------------+
    |    | reviews                                                                  |
    |----+--------------------------------------------------------------------------|
    |  0 | This is the first sentence of the first review. And this is the second.  |
    |  1 | This is the first sentence of the second review. And this is the second. |
    +----+--------------------------------------------------------------------------+
    

    Now let's define a function which, given a string, returns the list of its sentences as texts (strings).

    def obtain_sentences(s):
        doc = nlp(s)
        sents = [sent.text for sent in doc.sents]
        return sents
    

    The function can be applied to the comments DataFrame to produce a new DataFrame containing sentences.

    data = comments.copy()
    data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
    data = data.explode('reviews').reset_index(drop=True)
    data
    

    I used explode to transform the elements of the lists of sentences into rows.

    And this is the obtained output!

    +----+--------------------------------------------------+
    |    | reviews                                          |
    |----+--------------------------------------------------|
    |  0 | This is the first sentence of the first review.  |
    |  1 | And this is the second.                          |
    |  2 | This is the first sentence of the second review. |
    |  3 | And this is the second.                          |
    +----+--------------------------------------------------+