Search code examples
pythonpandasdataframenlpspacy

Creating a list of sentences from a file and adding it into a dataframe


I am using the code below to create a list of sentences from a file document. The function will return a list of sentences.


def extract_sentences(file):
    content = nlp(file)
    sentences = list(content.sents)
    return sentences

After that, I want to add each sentence in a dataframe, under the column "sentence". The problem is that in the dataframe, the sentences appear like a list of words, divided by comma, eg: (this, process, includes, different, stages... ). But I want it to appear like: this process includes different stages


Solution

  • content.sents is a generator object that holds spacy.tokens.span.Span objects.

    If you want to have a list of strings as output, you can use

    def extract_sentences(file):
        content = nlp(file)
        return [x.text for x in content.sents]
    

    Note the .text property returns the textual representation of the span object.