I am working on a logistic regression model to predict if a customer is a business or non-business costumer with the help of Keras in TensorFlow. At the moment I am able to use columns like latitude with the help of tf.feature_columns
. Now I am working on the NAME1 field. The name often has repeating parts like “GmbH” (e.g. “Mustermann GmbH”) which in this context has a similar meaning to Corp. which is an indicator that the customer is a business customer. To separate all the different parts of the name and to work with them separately, I am using tokenization with the help of the function text_to_word_sequence()
.
I import the data into a Pandas Dataframe and afterwards I convert this Dataframe to a TensorFlow Dataset with the function from_tensor_slices()
so I can work with the tf.feature_columns
function.
I tried two different strategies for the tokenization:
apply()
to create a new tokenized column within the Dataframe:
data['NAME1TOKENIZED'] = data['NAME1'].apply(lambda x: text_to_word_sequence(x))
The new column has the following structure: 0 [palle]
1 [pertl]
2 [graf, robert]
3 [löberbauer, stefanie, asg]
4 [stauber, martin, asg]
...
99995 [truber]
99996 [mesgec]
99997 [mesgec]
99998 [miedl]
99999 [millegger]
Name: NAME1TOKENIZED, Length: 100000, dtype: object
As you can see, the list has an different amount of entries, so I have problems to convert the Dataframe into a Dataset:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
I also tried the tf.ragged.constant()
function to create a ragged Tensor which allows this type of lists.
Here my function for converting the DataFrame to a Dataset:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
tok_names = dataframe.loc[:,'NAME1TOKENIZED']
del dataframe['NAME1TOKENIZED']
rt_tok_names = tf.ragged.constant(tok_names)
labels = dataframe.pop('RECEIVERTYPE')
labels = labels - 1
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), rt_tok_names, labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
This works pretty well but as you can imagine, now I have a problem on the other side. When I am now trying to use the following function:
name_embedding = tf.feature_column.categorical_column_with_hash_bucket('NAME1TOKENIZED', hash_bucket_size=2500)
I get the following Error:
ValueError: Feature NAME1TOKENIZED is not in features dictionary.
I also tried to input a Dataframe instead of a Serie into tf.ragged.constant()
so I can use dict(rt_tok_names)
to pass the label, but then I am getting the following error again:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list)
.
train_ds.map(lambda x, _: text_to_word_sequence(x['NAME1']))
But I got the following error:
AttributeError: 'Tensor' object has no attribute 'lower'
As you can see I tried it several ways but without success. I would be happy for any recommendations how to solve my problem.
Thanks!
I found a solution for my problem. I used the Tokenizer to transform the text to sequences and then I pad the resulting list of sequences per row to the max length of two. Finally, I added these two new columns to the Dataframe. Afterwards I was able to transform the Dataframe to a Dataset and then I used these two columns with the help of tf.feature_column
Here the relevant code:
t = Tokenizer(num_words=name_num_words)
t.fit_on_texts(data['NAME1PRO'])
name1_tokenized = t.texts_to_sequences(data['NAME1PRO'])
name1_tokenized_pad = tf.keras.preprocessing.sequence.pad_sequences(name1_tokenized, maxlen=2, truncating='pre')
data = pd.concat([data, pd.DataFrame(name1_tokenized_pad, columns=['NAME1W1', 'NAME1W2'])], axis=1)