Search code examples
pythonpython-3.xnlptokenize

Is there a better way to tokenize some strings?


I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
    s.append([])
    s[a].append(line.split())
    a+=1
print(s)

the output came out to be:

[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]

As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
    s.append([])
    m=(line.split())
    for word in m:
        s[a].append(word)
    a += 1
print(s)

which got me the correct output:

[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?


Solution

  • Your original code is so nearly there.

    >>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
    >>> s=[]
    >>> for line in str:
    ...   s.append(line.split())
    ...
    >>> print(s)
    [['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
    

    The line.split() gives you a list, so append that in your loop. Or go straight for a comprehension:

    [line.split() for line in str]
    

    When you say s.append([]), you have an empty list at index 'a', like this:

    L = []
    

    If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]