Search code examples
pythonmachine-learningtext-classification

Check if a string format a word


I am building a python text classification application. In the app the user provides a small sentence (or a single word) and we classify his sentence. The problem I'm facing is to find a way to check if his string format a word or a group of words.

Examples of users inputs:

1) "asdfasdfa"

2) "This is adsfgafdga"

The example 1 is not a word so I want to raise an Error, also the example 2 contains a non-word string in it so I want to raise an Error too.

Correct Examples:

1) "Hello"

2) "This is good"

Is there a way to do that without a list of words or someone know an API to do that?


Solution

  • One extensive method is to create a list and store the dictionary words in it. First perform a split on the user input to singularly extract each word off a phrase using a phrase.split().

    words = phrase.split() 
    // words : ['This', 'is', 'good'] 
    
    len(words) 
    // number of words : 3 
    

    Run a loop according to the number of words in the phrase if the result is greater than 1. And then its a mere matter of checking whether the word is present in the list using the following.

    if "word" in dictionary_words:
       print "Word is available"
    

    There's a neat XML version of the dictionary words you can use instead of the list.

    For a more sophisticated solution, you can try incorporating an API like PyEnchant that provisions a spell checking library. For further details in this regard, you can check it out and do a pip install pyenchantand import it.

    >>> import enchant
    >>> help(enchant)