Is there a better way to tokenize some strings?

I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
    s.append([])
    s[a].append(line.split())
    a+=1
print(s)

the output came out to be:

[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]

As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
    s.append([])
    m=(line.split())
    for word in m:
        s[a].append(word)
    a += 1
print(s)

which got me the correct output:

[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?

Solution

Your original code is so nearly there.

>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
...   s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

The line.split() gives you a list, so append that in your loop. Or go straight for a comprehension:

[line.split() for line in str]

When you say s.append([]), you have an empty list at index 'a', like this:

L = []

If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]