java android machine-learning nlp stanford-nlp

Query classification for Virtual Assistant in Java?

This is my first time posting to Stack Overflow, so please let know if I should be more thorough when asking questions in the future.

Currently I am working on a Virtual Assistant application for Android using Java, and although it is going well so far, I am unsure how to approach classifying user input. So far I have implemented the Stanford NLP Parser within the program, so that clause, phrase, and word labels can be applied to the raw text. This has allowed me to have the program recognise direct questions and extract the subject from it, simply by searching for the occurrence of certain tags.

(ROOT
  (SBARQ <--- Indicates that the sentence is a question
    (WHNP (WP Who))
      (SQ (VBD were)
        (NP (DT the) (FW samurai))) <--- Subject of question
      (. ?)))

Although this feels like a step forward, I hope to eventually have the assistant capable of classifying different types of questions, (weather related questions, time/date related questions, etc) while also being capable of recognising questions that are not as direct but are asking for the same information (e.g. "can you tell me about the samurai?" as opposed to "who were the samurai?"). Doing this by just using the Stanford NLP Parser and looking for certain tags seems like a very difficult task. Does anyone have any advice on alternative approaches I could take?

Thank-you!

Solution

With regards to virtual assistants or chatbots this is usually called intent classification. There's a pile of ways to do this, but generally you provide labelled examples and train a model to differentiate them. Here's some example data from a blog post on the topic:

# 3 classes of training data
training_data = []
training_data.append({"class":"greeting", "sentence":"how are you?"})
training_data.append({"class":"greeting", "sentence":"how is your day?"})
training_data.append({"class":"greeting", "sentence":"good day"})
training_data.append({"class":"greeting", "sentence":"how is it going today?"})

training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"see you later"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"talk to you soon"})

training_data.append({"class":"sandwich", "sentence":"make me a sandwich"})
training_data.append({"class":"sandwich", "sentence":"can you make a sandwich?"})
training_data.append({"class":"sandwich", "sentence":"having a sandwich today?"})
training_data.append({"class":"sandwich", "sentence":"what's for lunch?"})

While your training data is specific to your application, in principle it's not different from automatically categorizing emails or news articles.

A easy-to-use baseline algorithm for text classification is Naive Bayes. More recent methods include using Word Mover's Distance or neural networks.

The part where you extract the subject is also called slot detection, and "intent and slot" architectures for assistants are common. Even if you want to build something from scratch, looking at configuration screens for chatbot platforms like rasa may be helpful to get an idea of how to use training data.