Check if a string is a word or part of a word in Tkinter Text Widget

I am working on a spellchecker for a tkinter text widget. I've got it working so that the user can select an incorrect word and replace all instances of the incorrect word in the text widget. However, if the word appears within another word, it will also replace it. I don't want this.

For example: Say the user had the sentence: Hello how ay you today And they've miss spelt the word 'are' as 'ay', they could right-click on it to replace all instances or the word 'ay' with 'are'.

My problem is, the string 'ay' appears in 'today'. This mean that when the user right clicks on 'ay', it turns 'today' into 'todare' - replacing the 'ay' in 'today' with 'are'

To replace the word I am using the search function. I thought about checking to see if the characters either side of the miss spelt word were spaces, but I didn't know how to implement it. Here is my code below (note - this is vastly simplified and my actual code is thousands of lines long. In the real program, the button is a context menu):

from spellchecker import SpellChecker

root = Tk()
notepad = Text(root)
notepad.pack()

spell_dict = SpellChecker()


def check_spelling(event):
    global spell_dict

    misspelt_words_list = [] 
    paragraph_list = notepad.get('1.0', END).strip('\n').split()

    notepad.tag_config('misspelt_word_tag', foreground='red', underline=1)

        for word in paragraph_list:

            if (word not in spell_dict) and (word not in  misspelt_words_list):
                    misspelt_words_list.append(word)

            elif (word in misspelt_words_list) and (word in spell_dict):
                misspelt_words_list.remove(word)

    notepad.tag_remove('misspelt_word_tag', 1.0, END)

    for misspelt_word in misspelt_words_list:
        misspelt_word_offset = '+%dc' % len(misspelt_word) 

        pos_start = notepad.search(misspelt_word, '1.0', END)

        while pos_start:

            pos_end = pos_start + misspelt_word_offset
                notepad.tag_add("misspelt_word_tag",pos_start,pos_end)

            pos_start = notepad.search(misspelt_word,pos_end,END)


button = Button(root, text = "This is a test", command = check_spelling)
button.pack()

root.mainloop()

Like I said before, if the user writes ll ll hello, where 'll' is miss spelt (let's say the program will correct it to I'll), when the user presses the button it should replace all words written 'll', but not replace the 'll' in 'hello'.

THIS: ll ll hello -> I'll I'll hello, NOT: ll ll hello -> I'll I'll heI'llo

Thanks for your help.

(I'm using Windows 10 with Python 3.7)

Solution

The solution to your problem is to use regular expressions. Regular expressions let you search for more than just text. You can also search for patterns and other metacharacters. For example, an expression could only match a string at the start of a line or start of a word.

In your case, you're wanting to find whole words. In the context of the text widget search method, a whole word can be searched for by surrounding the string you're searching for with \m (start of word) and \M (end of word).

For example, to search for "ll" only as a whole word, you should search for \mll\M. Because the backslash is a special character in python and we need the backslash to be passed to the search method, it needs to be protected. The easiest way is to use a raw string.

So, given a word in a variable (eg: word="ll"), we can make a pattern that looks like this:

pattern = r'\m{}\M'.format(word)

To use that pattern in a search, we need to set the regexp parameter of the search method to True. There are a couple of other things that need to be done. We want to have the search method tell us how many characters matched the pattern. In the case of searching for "ll" we know it will always be two characters, but a good general solution would be to have the search mechanism tell us. We can do that by passing an IntVar to the search method.

The other thing we need to do is make sure the search stops at the end of the widget, otherwise, it will wrap around to the start and continue searching forever.

Once we have all of that in place, we can search for the string "ll" in the text widget only as whole words with something like this:

countvar = IntVar()
pos = "1.0"
pattern = r'\mll\M'

pos = notepad.search(pattern, pos, "end", count=countvar, regexp=True)
pos_end = notepad.index("{} + {} chars".format(pos, countvar.get()))

With that, pos marks the beginning of the match and pos_end marks the end of the match. if pos is the empty string then we know tkinter didn't find a match (and in that case we can skip computing pos_end).

Putting it all together, we can create a general purpose function that finds and highlights all of the words in a list with something like this:

def highlight_words(widget, tag, word_list):
    """Find all whole words in word_list and apply the given tag"""
    widget.tag_remove(tag, "1.0", END)

    countvar = IntVar()
    for word in word_list:
        pos = "1.0"
        pattern = r"\m{}\M".format(word)
        while widget.compare(pos, "<", "end"):
            pos = widget.search(pattern, pos, "end", count=countvar, regexp=True)
            if pos:
                pos_end = widget.index("{} + {} chars".format(pos, countvar.get()))
                widget.tag_add(tag,pos,pos_end)
                pos = pos_end
            else:
                break

We can use this function like this:

root = Tk()
notepad = Text(root)
notepad.pack()
notepad.tag_configure("misspelt_word_tag", background="pink")

notepad.insert("end", "ll ll hello")
misspelt_word_list = ['ll']
highlight_words(notepad, "misspelt_word_tag", misspelt_word_list)

root.mainloop()

For an overview of regular expressions, see the documentation for the re module.

The regular expressions used in the text widget search method are slightly different than python regular expressions. For example, python uses \b to mean the beginning or end of a word whereas the search method uses \m and \M. For a detailed explanation of the expression syntax used by the search method see Tcl's re_syntax man page