Search code examples
google-cloud-platformnlpgoogle-natural-language

How can I categorize tweets with Google Cloud Natural Language API - if possible?


I am trying to use Google Cloud Natural Language API to classify/categorize tweets in order to filter out tweets that are not relevant to my audience (weather related). I can understand it must be tricky for an AI solution to make a classification on a short amount of text but I would imagine it would at least have a guess on text like this:

Wind chills of zero to -5 degrees are expected in Northwestern Arkansas into North-Central Arkansas extending into portions of northern Oklahoma during the 6-9am window . #arwx #okwx

I have tested several tweets but only very few get a categorization, the rest gets no result (or "No categories found. Try a longer text input." if I try it through the GUI).

Is it pointless to hope for this to work? Or, is it possible to decrease the threshold for the categorization? An "educated guess" from the NLP-solution would be better than no filter at all. Is there an alternate solution (outside training my own NLP-model)?

Edit: In order to clarify:

I am, in the end, using the Google Cloud Platform Natural language API in order to classify tweets. In order to test it I am using the GUI (linked above). I can see that quite few of the tweets I test (in the GUI) gets a categorization from GCP NLP, i.e. the category is empty.

The desired state I want is for GCP NLP to provide a category guess of a tweet text, rather than providing an empty result. I assume the NLP model removes any results with a confidence less than X%. It would be interesting to know if that threshold could be configured.

I assume the categorization of tweets must have been done before, and if there is any other way to solve this?

Edit 2: ClassifyTweet-code:

async function classifyTweet(tweetText) {
   const language = require('@google-cloud/language');
   const client = new language.LanguageServiceClient({projectId, keyFilename});
   //const tweetText = "Some light snow dusted the ground this morning, adding to the intense snow fall of yesterday. Here at my Warwick station the numbers are in, New Snow 19.5cm and total depth 26.6cm. A very good snow event. Photos to be posted. #ONStorm #CANWarnON4464 #CoCoRaHSON525"
   const document = {
      content: tweetText,
      type: 'PLAIN_TEXT',
   };   
   const [classification] = await client.classifyText({document});
   
   console.log('Categories:');
   classification.categories.forEach(category => {
     console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
   });
   
   return classification.categories
}

Solution

  • I have dig on the current state of cloud natural language and my answer to your principal question will be that at the current state of the natural language classify text is not possible. Although, a workaround would be if you base your categories on the output you get from analyzing the text from your inputs.

    Consider that we are not using a custom model for this and just using the options that cloud natural language offers, One tentative approach on this matter will be as follows:

    To start, I have updated the code from the official samples to our needs to explain a bit further on this:

    from google.cloud import language_v1 
    from google.cloud.language_v1 import enums 
    
    
    def sample_cloud_natural_language_text(text_content):
        """ 
        Args:
          text_content The text content to analyze. Must include at least 20 words.
        """
    
        client = language_v1.LanguageServiceClient()
        type_ = enums.Document.Type.PLAIN_TEXT
    
        language = "en"
        document = {"content": text_content, "type": type_, "language": language}
    
    
        print("=====CLASSIFY TEXT=====")
        response = client.classify_text(document)
        for category in response.categories:
            print(u"Category name: {}".format(category.name))
            print(u"Confidence: {}".format(category.confidence))
    
    
        print("=====ANALYZE TEXT=====")
        response = client.analyze_entities(document)
        for entity in response.entities:
            print(f">>>>> ENTITY {entity.name}")  
            print(u"Entity type: {}".format(enums.Entity.Type(entity.type).name))
            print(u"Salience score: {}".format(entity.salience))
    
            for metadata_name, metadata_value in entity.metadata.items():
                print(u"{}: {}".format(metadata_name, metadata_value))
    
            for mention in entity.mentions:
                print(u"Mention text: {}".format(mention.text.content))
                print(u"Mention type: {}".format(enums.EntityMention.Type(mention.type).name))
    
    
    if __name__ == "__main__":
        #text_content = "That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows."
        text_content="Wind chills of zero to -5 degrees are expected in Northwestern Arkansas into North-Central Arkansas extending into portions of northern Oklahoma during the 6-9am window"
        
        sample_cloud_natural_language_text(text_content)
    

    output

    =====CLASSIFY TEXT=====
    =====ANALYZE TEXT=====
    >>>>> ENTITY Wind chills
    Entity type: OTHER
    Salience score: 0.46825599670410156
    Mention text: Wind chills
    Mention type: COMMON
    >>>>> ENTITY degrees
    Entity type: OTHER
    Salience score: 0.16041776537895203
    Mention text: degrees
    Mention type: COMMON
    >>>>> ENTITY Northwestern Arkansas
    Entity type: ORGANIZATION
    Salience score: 0.07702474296092987
    mid: /m/02vvkn4
    wikipedia_url: https://en.wikipedia.org/wiki/Northwest_Arkansas
    Mention text: Northwestern Arkansas
    Mention type: PROPER
    >>>>> ENTITY North
    Entity type: LOCATION
    Salience score: 0.07702474296092987
    Mention text: North
    Mention type: PROPER
    >>>>> ENTITY Arkansas
    Entity type: LOCATION
    Salience score: 0.07088913768529892
    mid: /m/0vbk
    wikipedia_url: https://en.wikipedia.org/wiki/Arkansas
    Mention text: Arkansas
    Mention type: PROPER
    >>>>> ENTITY window
    Entity type: OTHER
    Salience score: 0.06348973512649536
    Mention text: window
    Mention type: COMMON
    >>>>> ENTITY Oklahoma
    Entity type: LOCATION
    Salience score: 0.04747137427330017
    wikipedia_url: https://en.wikipedia.org/wiki/Oklahoma
    mid: /m/05mph
    Mention text: Oklahoma
    Mention type: PROPER
    >>>>> ENTITY portions
    Entity type: OTHER
    Salience score: 0.03542650490999222
    Mention text: portions
    Mention type: COMMON
    >>>>> ENTITY 6
    Entity type: NUMBER
    Salience score: 0.0
    value: 6
    Mention text: 6
    Mention type: TYPE_UNKNOWN
    >>>>> ENTITY 9
    Entity type: NUMBER
    Salience score: 0.0
    value: 9
    Mention text: 9
    Mention type: TYPE_UNKNOWN
    >>>>> ENTITY -5
    Entity type: NUMBER
    Salience score: 0.0
    value: -5
    Mention text: -5
    Mention type: TYPE_UNKNOWN
    >>>>> ENTITY zero
    Entity type: NUMBER
    Salience score: 0.0
    value: 0
    Mention text: zero
    Mention type: TYPE_UNKNOWN
    

    As you can see, classify text do not helps a lot (the result its empty). Its when we start to analyze text that we can get some values. We can use that to build or own categories. The trick (and hard-work too) will be to make the pool of key words that will fit each category (a category built by us) that we can use to set the data that we are analyzing. About categorization, we can check the current list of available categories made by google to have an idea of what categories should look like.

    I don't think there is a feature to lower the bar yet implemented with current builds but its something than can be requested to google as a feature.