Search code examples
pythontensorflowkerasnlptoken

How to Tokenize a list of lists of lists of strings


I have a text dataset that is a list of lists of lists of strings. I need to Tokenize this data to fit it into a classification model. I am very familiar with using keras.preprocessing.text.Tokenizer to do so, and often do this with the below code:

data = 
    [[['not'],
      ['ahead'],
      ['um let me think'],
      ['thats not very encouraging if they had a cast of thousands on the other end']],
      [['okay civil liberties tell me your position'],
     ['probably would go ahead']],
     [['oh'],
     ['it up so i dont know where you really go'],
     ['well most of my problem with this latest task'],
     ['its some i kind of dont want to put in the time to do it'],
     ['right so im saying ive got a lot of other things to do']]]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)

When I run this code on my data, I get the following error:

2 frames
<ipython-input-44-1da804f42cc8> in main()
     12     # tokenize and vectorize text data to prepare for embedding
     13     tokenizer = Tokenizer()
---> 14     tokenizer.fit_on_texts(new_corpus)
     15     sequences = tokenizer.texts_to_sequences(new_corpus)
     16     word_index = tokenizer.word_index

/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in fit_on_texts(self, texts)
    213                 if self.lower:
    214                     if isinstance(text, list):
--> 215                         text = [text_elem.lower() for text_elem in text]
    216                     else:
    217                         text = text.lower()

/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in <listcomp>(.0)
    213                 if self.lower:
    214                     if isinstance(text, list):
--> 215                         text = [text_elem.lower() for text_elem in text]
    216                     else:
    217                         text = text.lower()

AttributeError: 'list' object has no attribute 'lower'

This makes sense to me, because the Tokenizer function expects a string but gets a list. Normally, I would flatten my list structure to pass it through the Tokenizer function.

However, I cannot do this because my nested list structure is essential for my modeling.

How then, can I Tokenize my data while preserving the list structure? I want to treat the whole thing as my corpus and get unique word integer tokens across all lists.

It should look something like this (tokenization done by hand here so pardon if there is a typo):

data = 
    [[[0],
      ['1'],
      ['2, 3, 4, 5'],
      ['6, 0, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19']],
      [['20, 21, 22, 23, 3, 24, 25'],
     ['26, 27, 28, 29']],
     [['30'],
     ['31, 32, 33, 34, 35, 36, 37, 38, 39, 40'],
     ['41, 42, 43, 44, 45, 46, 47, 48, 49'],
     ['50, 51, 34, 52, 14, 35, 53, 54, 55, 56, 17, 57, 58, 59, 31'],
     ['60, 61, 62, 63, 64, 65, 12, 66, 14, 67, 68, 59, 31']]]

Solution

  • You can do the following to preserve the structure and do the indexing,

    tok_data = [y[0] for x in data for y in x]
    
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(tok_data)
    
    sequences = []
    for x in data:
      tmp = []
      for y in x:
        tmp.append(tokenizer.texts_to_sequences(y)[0])
      sequences.append(tmp)