How do I avoid printing " " in my tokenize function?

I'm supposed to create a word counting program in Python, which checks the kinds of words in a given text and the frequency of those words.

As part of the program, certain stop words should not be in the count, and neither should spaces and special characters (+-??:"; etc).

The first part of the program is to create a tokenize function (I will later test my function, which should go through the following test):

if hasattr(wordfreq, "tokenize"):
    fun_count = fun_count + 1
    test(wordfreq.tokenize, [], [])
    test(wordfreq.tokenize, [""], [])
    test(wordfreq.tokenize, ["   "], [])
    test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
    test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
    test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
    test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
    test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
else:
    print("tokenize is not implemented yet!")

But my function passes 7 out of 8.

The output after the test is:

Condition failed:
tokenize([' ']) == []
tokenize returned/printed:
['']
countWords is not implemented yet!
printTopMost is not implemented yet!
7 out of 8 passed.

I suspect that it has something to do with my else statement. Something about how I have used end = start or something similar.

Could anyone help me with what I should change and also explain the difference between the correct solution and my solution?

My code:

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start
            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1
                words.append(line[start:end])
                start = end
            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
                words.append(line[start:end].lower())
                start = end
            else: 
                end = start
                end < len(line)
                end = end + 1
                words.append(line[start:end])
                start = end 
    return words

Solution

Everything looks right to me except the last else where i think you missed an if condition. I also added a line.strip() to start with before any of the logic.

The condition [" "], [] is failing, because if you don't strip the empty sentences, the final result will be [''] and the test case fails because [] not equal to ['']

def tokenize(lines):
    words = []
    for line in lines:
        line = line.strip()
        start = 0
        while start < len(line):

            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start

            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1
                words.append(line[start:end])
                start = end

            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
                words.append(line[start:end].lower())
                start = end
            else:
                end = start
                if end < len(line):
                    end = end + 1
                words.append(line[start:end])
                start = end

    return words

If you don't want to use line.strip(), another way to implement the same will be to have an additional if condition before you append to words as shown below:

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):

            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start

            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1

            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
            else:
                end = start
                if end < len(line):
                    end = end + 1

            if start != end:
                words.append(line[start:end].lower())

            start = end

    return words