Search code examples
pythonpandasdataframetensorflowfeature-engineering

Pandas Dataframe, TensorFlow Dataset: Where to do the TensorFlow Tokenization step?


I am working on a logistic regression model to predict if a customer is a business or non-business costumer with the help of Keras in TensorFlow. At the moment I am able to use columns like latitude with the help of tf.feature_columns. Now I am working on the NAME1 field. The name often has repeating parts like “GmbH” (e.g. “Mustermann GmbH”) which in this context has a similar meaning to Corp. which is an indicator that the customer is a business customer. To separate all the different parts of the name and to work with them separately, I am using tokenization with the help of the function text_to_word_sequence(). I import the data into a Pandas Dataframe and afterwards I convert this Dataframe to a TensorFlow Dataset with the function from_tensor_slices() so I can work with the tf.feature_columns function. I tried two different strategies for the tokenization:

  1. Tokenization before converting the pandas Dataframe to a TensorFlow Dataset After importing the Dataframe I used the Pandas Dataframe method apply() to create a new tokenized column within the Dataframe: data['NAME1TOKENIZED'] = data['NAME1'].apply(lambda x: text_to_word_sequence(x)) The new column has the following structure:
    0                            [palle]
    1                            [pertl]
    2                     [graf, robert]
    3        [löberbauer, stefanie, asg]
    4             [stauber, martin, asg]
                        ...             
    99995                       [truber]
    99996                       [mesgec]
    99997                       [mesgec]
    99998                        [miedl]
    99999                    [millegger]
    Name: NAME1TOKENIZED, Length: 100000, dtype: object

As you can see, the list has an different amount of entries, so I have problems to convert the Dataframe into a Dataset: ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list). I also tried the tf.ragged.constant() function to create a ragged Tensor which allows this type of lists. Here my function for converting the DataFrame to a Dataset:

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    tok_names = dataframe.loc[:,'NAME1TOKENIZED']
    del dataframe['NAME1TOKENIZED']
    rt_tok_names = tf.ragged.constant(tok_names)
    labels = dataframe.pop('RECEIVERTYPE')
    labels = labels - 1
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), rt_tok_names, labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

This works pretty well but as you can imagine, now I have a problem on the other side. When I am now trying to use the following function:

name_embedding = tf.feature_column.categorical_column_with_hash_bucket('NAME1TOKENIZED', hash_bucket_size=2500)

I get the following Error:

ValueError: Feature NAME1TOKENIZED is not in features dictionary.

I also tried to input a Dataframe instead of a Serie into tf.ragged.constant() so I can use dict(rt_tok_names) to pass the label, but then I am getting the following error again: ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

  1. Tokenization after converting the pandas Dataframe to a TensorFlow Dataset I have tried e.g. the following:

train_ds.map(lambda x, _: text_to_word_sequence(x['NAME1']))

But I got the following error: AttributeError: 'Tensor' object has no attribute 'lower'

As you can see I tried it several ways but without success. I would be happy for any recommendations how to solve my problem.

Thanks!


Solution

  • I found a solution for my problem. I used the Tokenizer to transform the text to sequences and then I pad the resulting list of sequences per row to the max length of two. Finally, I added these two new columns to the Dataframe. Afterwards I was able to transform the Dataframe to a Dataset and then I used these two columns with the help of tf.feature_column Here the relevant code:

    t = Tokenizer(num_words=name_num_words)
    t.fit_on_texts(data['NAME1PRO'])
    name1_tokenized = t.texts_to_sequences(data['NAME1PRO'])
    
    name1_tokenized_pad = tf.keras.preprocessing.sequence.pad_sequences(name1_tokenized, maxlen=2, truncating='pre')
    
    data = pd.concat([data, pd.DataFrame(name1_tokenized_pad, columns=['NAME1W1', 'NAME1W2'])], axis=1)