Search code examples
pythondataframemachine-learningnlpdata-science

Anyone have a way to tokenize a paragraph, put each sentence into a pandas data frame, and perform sentiment analysis on each?


Beginner NLP/python programmer. Title says it all. I basically need a code that will tokenize a paragraph, perform sentiment analysis on each sentence put each sentence along with it's rating on a pandas data frame. I already have code that can tokenize a paragraph and even perform sentiment analysis, but I'm struggling with putting both into a data frame. Thus far, I have:

I used newspaper3k to extract the url and text.

from newspaper import fulltext
import requests
url = "https://www.click2houston.com/news/local/2021/06/18/houston-water-wastewater-proposed-increase-this-is-what-mayor-sylvester-turner-wants-you-to-know/"
text = fulltext(requests.get(url).text)

Then I used the BERT extractive summarizer to summarize the article text.

models = Summarizer()
result = models(text, min_length=30)
full = "".join(result)
type(full)

Then I tokenized the summary into sentences using nltk.

tokens=sent_tokenize(full)
print(type(np.array(tokens)[0]))

Lastly, I put it into a basic dataframe.

df = pd.DataFrame(np.array(tokens), columns=['sentences'])

The only thing I'm missing is the sentiment analysis. I simply need a sentiment analysis (preferably from BERT) rating on each sentence implemented into the data frame.


Solution

  • Huggingface allows you to do what you want

    from transformers import pipeline
    from newspaper import fulltext
    import requests
    import pandas as pd
    import numpy as np
    url = "https://www.click2houston.com/news/local/2021/06/18/houston-water-wastewater-proposed-increase-this-is-what-mayor-sylvester-turner-wants-you-to-know/"
    text = fulltext(requests.get(url).text)
    texts = [item.strip() for item in text.split('\n')[:10] if item.strip()]
    summarizer = pipeline("summarization")
    sentiment_analyser = pipeline('sentiment-analysis')
    sumerize = lambda text:simmarizer(text, min_length=5, max_length=30)
    sentiment_analyse = lambda sentiment_analyser:snt(text)
    df = pd.DataFrame(np.array(texts), columns=['lines'])
    df['Summarized'] = df.lines.apply(summarizer)
    df['Sentiment'] = df.lines.apply(sentiment_analyser)
    print(df.head())