Search code examples
pythonspacyhuggingface-transformers

ML Algo for determining if a sentence is a question


I have a form that user's are supposed to use to ask support questions. Instead we've noticed that alot of people are using it to submit feedback or statements to update parts of their record.

I wanted to run through all records and separate out questions from feedback or statements (assume that feedback and statements are NOT questions).

Does anybody know of a good pre-trained model or process I can use to resolve this type of issue?

I originally thought I could use Spacy to look for keywords (how, where, when why) or a "?" but I realized that some questions that came in had an implication of question rather than a properly formatted question (ex. "could I please have a copy of file A."). I went to HuggingFace and looked at some text-analysis models and couldnt find one that properly handled examples like above.


Solution

  • Hard-rules approach

    You can use the POS tagging of NLTK. Creating hard rules that identify patterns that can be associated with questions.

    import nltk
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')
    from nltk.tokenize import word_tokenize
    from nltk import pos_tag
    
    text = "could I please have a copy of file A."
    
    # Tokenization
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    
    tags_only = [tag[1] for tag in pos_tags]
    

    tags_only:

    ['MD', 'PRP', 'VB', 'VB', 'DT', 'NN', 'IN', 'NN', 'NNP', '.']
    

    The 'MD', 'PRP', 'VB' sequence can be always be associated to questions.

    This, together with regex on the presence of "how", "where", "when", "why", "wondering" or "?", might do the trick.

    Clustering approach

    However you might also perform a simple clustering on your corpus to identify different types of text, in this kind of approach you might do 2 distinct clustering:

    • basic one on the text itself
    • pos tagging clustering to add further information to your corpus