I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
s.append([])
s[a].append(line.split())
a+=1
print(s)
the output came out to be:
[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]
As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2]
instead of s[0][2]
, so I changed the code to:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
s.append([])
m=(line.split())
for word in m:
s[a].append(word)
a += 1
print(s)
which got me the correct output:
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n
would be a lot better that n^2
, so, is there a better way to do this/a way to do this with one loop?
Your original code is so nearly there.
>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
... s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
The line.split()
gives you a list, so append that in your loop.
Or go straight for a comprehension:
[line.split() for line in str]
When you say s.append([])
, you have an empty list at index 'a', like this:
L = []
If you append the result of the split
to that, like L.append([1])
you end up with a list in this list: [[1]]