Search code examples
pythonsentiment-analysispos-taggerinformation-theory

How to detect aboutness with python pos tagger


I am working with python to take a facebook status, tell what the status is about and the sentiment. Essentially I need to tell what the sentiment refers to, I already have successfully coded a sentiment analyzer so the trouble is getting a POS tagger to compute what the sentiment is referring to.

If you have any suggestions from experience I would be grateful. I've read some papers on computing aboutness from subject-object, NP-PP, and NP-NP relations but haven't seen any good examples and havent found many papers.

Lastly if you have worked with POS-taggers, what would be my best bet in python as a non-computer scientist. I'm a physicist so I can hack code together but don't want to reinvent the wheel if there exists a package that has everything I'm going to need.

Thank you very much in advance!


Solution

  • This is what I found to work, going to edit it and use it with nltk pos tagger and see what results I can get.

    import nltk
    from nltk.corpus import brown
    
    # http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/
    
    
    # This is our fast Part of Speech tagger
    #############################################################################
    brown_train = brown.tagged_sents(categories='news')
    regexp_tagger = nltk.RegexpTagger(
        [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
         (r'(-|:|;)$', ':'),
         (r'\'*$', 'MD'),
         (r'(The|the|A|a|An|an)$', 'AT'),
         (r'.*able$', 'JJ'),
         (r'^[A-Z].*$', 'NNP'),
         (r'.*ness$', 'NN'),
         (r'.*ly$', 'RB'),
         (r'.*s$', 'NNS'),
         (r'.*ing$', 'VBG'),
         (r'.*ed$', 'VBD'),
         (r'.*', 'NN')
    ])
    unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
    bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
    #############################################################################
    
    
    # This is our semi-CFG; Extend it according to your own needs
    #############################################################################
    cfg = {}
    cfg["NNP+NNP"] = "NNP"
    cfg["NN+NN"] = "NNI"
    cfg["NNI+NN"] = "NNI"
    cfg["JJ+JJ"] = "JJ"
    cfg["JJ+NN"] = "NNI"
    #############################################################################
    
    
    class NPExtractor(object):
    
        def __init__(self, sentence):
            self.sentence = sentence
    
        # Split the sentence into singlw words/tokens
        def tokenize_sentence(self, sentence):
            tokens = nltk.word_tokenize(sentence)
            return tokens
    
        # Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
        def normalize_tags(self, tagged):
            n_tagged = []
            for t in tagged:
                if t[1] == "NP-TL" or t[1] == "NP":
                    n_tagged.append((t[0], "NNP"))
                    continue
                if t[1].endswith("-TL"):
                    n_tagged.append((t[0], t[1][:-3]))
                    continue
                if t[1].endswith("S"):
                    n_tagged.append((t[0], t[1][:-1]))
                    continue
                n_tagged.append((t[0], t[1]))
            return n_tagged
    
        # Extract the main topics from the sentence
        def extract(self):
    
            tokens = self.tokenize_sentence(self.sentence)
            tags = self.normalize_tags(bigram_tagger.tag(tokens))
    
            merge = True
            while merge:
                merge = False
                for x in range(0, len(tags) - 1):
                    t1 = tags[x]
                    t2 = tags[x + 1]
                    key = "%s+%s" % (t1[1], t2[1])
                    value = cfg.get(key, '')
                    if value:
                        merge = True
                        tags.pop(x)
                        tags.pop(x)
                        match = "%s %s" % (t1[0], t2[0])
                        pos = value
                        tags.insert(x, (match, pos))
                        break
    
            matches = []
            for t in tags:
                if t[1] == "NNP" or t[1] == "NNI":
                #if t[1] == "NNP" or t[1] == "NNI" or t[1] == "NN":
                    matches.append(t[0])
            return matches
    
    
    # Main method, just run "python np_extractor.py"
    Summary="""
    
    
    Verizon has not honored this appointment or notified me of the delay in an appropriate manner. It is now 1:20 PM and the only way I found out of a change is that I called their chat line and got a message saying my appointment is for 2 PM. My cell phone message says the original time as stated here.
    
    
    """
    def main(Topic):
        facebookData=[]
        readdata=csv.reader(open('fb_data1.csv','r'))
        for row in readdata:
            facebookData.append(row)
        relevant_sentence=[]
        for status in facebookData:
            summary=status.split('.')
            for sentence in summary:
                np_extractor = NPExtractor(sentence)
                result = np_extractor.extract()
                if Topic in result:
                    relevant_sentence.append(sentence)
                print sentence
                print "This sentence is about: %s" % ", ".join(result)
            return relevant_sentence
    
    if __name__ == '__main__':
        result=main('Verizon')
    

    note that it will save only sentences that are relevant to the topic you define. so if I am analyzing statuses about cheese I could use it as the topic, extract all of the sentences on cheese and then run a sentiment analysis on those. Please if you have comments or suggestions on improving this let me know!