Search code examples
pythontensorflowmachine-learningkerasneural-network

Creating input tensors with the right dimensions from data


I have 4 features, 2 continous ones taking the form of

[1,2,3,4,1,1,...]

And 2 categoric in the form of

[["A"],["A","B"],["A","B","C"], ["A", "C"]]

My labels take the same form as continous features, from 0 to 8 for multiclassification. My goal is to predict the label class based on the 4 features.

I extract my data from a json file into a Pandas dataframe that looks like this:

         f1         f2   f3                f4         label
 0        1          3     [R]              [None]        1
 1        2          2     [U, W]           [Flying]      2
 2        1          4     [None]           [None]        2
 3        1          2     [B]              [Flying]      0
..     ...        ...     ...                 ...   ...

From my understanding of the ragged tensor documentation I can directly feed all of these features into my model like so:

import tensorflow as tf
f1 = tf.keras.Input(shape=(1,),dtype=tf.dtypes.int32)
f2 = tf.keras.Input(shape=(1,),dtype=tf.dtypes.int32)
f3 = tf.keras.Input(shape=(None,),dtype=tf.dtypes.string, ragged=True)
f4 = tf.keras.Input(shape=(None,),dtype=tf.dtypes.string, ragged=True)

After that I normalize f1 and f2, and use a StringLookup, Embedding and Flatten layer for f3 and f4. Then I concatenate and feed them into a couple of Dense layers and then into a final dense layer using a softmax.

My model builds sucesfully.

However when I pass my dataframe to my training function like so:

features = ["f1","f2","f3","f4"]
model.fit(x=[training_set[feature] for feature in features],
             y=training_set[label],
             validation_split=0.1)

I get the following error

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

Next I tried manually turning my dataframe into NumPy arrays with the right type:

import numpy as np
f1 = np.asarray(training_set["f1"]).astype(np.int32)
f2 = np.asarray(training_set["f2"]).astype(np.int32)
f3 = np.asarray(training_set["f3"]).astype(object)
f4 = np.asarray(training_set["f4"]).astype(object)
l = np.asarray(training_set["label"]).astype(np.int32)
model.fit(x=[f1,f2,f3,f4],
          y=l,
          validation_split=0.1)   

Which produces the same error:

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

It seems to me like instead of Numpy Arrays I should convert the data to tensors and feed that into the fit method. If I interpret the tf.Data documentation correctly this should be possible?

In trying to do so I got stuck at getting the dimensions to be right, e.g :

f1 = tf.stack([tf.convert_to_tensor(i) for i in training_set["f1"].values],axis=0)
# Is shape (622,) afaik I need shape(1,)?
f3 = tf.ragged.stack(data["colors"])
# Is shape (622, None) afaik needs to be shape (None,)

What am I doing wrong?


Solution

  • It is quite complicated to flatten ragged tensors with a Keras model, but what you can do is use a LSTM layer or GlobalMaxPool1D or GlobalAveragePooling1D etc. to process and flatten the tensors. Just make sure the 2D tensor is somehow converted to a 1D tensor. Here is a working example with two GlobalMaxPool1D layers:

    Data preparation:

    import tensorflow as tf
    import pandas as pd
    
    d = {'f1': [1, 2, 1, 1], 
         'f2': [3, 2, 4, 2], 
         'f3':[['R'] , ['U', 'W'], [None], ['B']], 
         'f4':[[None] , ['Flying'], [None], ['Flying']],
         'label': [1, 2, 2, 0]}
    
    df = pd.DataFrame(data=d)
    df['f3'] =  df['f3'].apply(lambda x: ['[UNK]' if i == None else i for i in x ])
    df['f4'] =  df['f4'].apply(lambda x: ['[UNK]' if i == None else i for i in x])
    
    data_f1 = df[["f1"]].to_numpy()
    data_f2 = df[["f2"]].to_numpy()
    labels  = df[["label"]].to_numpy()
    
    data_f3 = tf.ragged.constant([df['f3'].to_list()])
    data_f4 = tf.ragged.constant([df['f4'].to_list()])
    

    Model:

    look_up_layer = tf.keras.layers.StringLookup()
    look_up_layer.adapt(tf.ragged.stack([data_f3, data_f4]))
    data_f3 = tf.squeeze(data_f3, axis=0)
    data_f4 = tf.squeeze(data_f4, axis=0)
    
    f1 = tf.keras.Input(shape=(1,),dtype=tf.dtypes.int32)
    f2 = tf.keras.Input(shape=(1,),dtype=tf.dtypes.int32)
    f3 = tf.keras.Input(shape=(None,),dtype=tf.dtypes.string, ragged=True)
    f4 = tf.keras.Input(shape=(None,),dtype=tf.dtypes.string, ragged=True)
    f1_dense = tf.keras.layers.Dense(5, activation='relu')(f1)
    f2_dense = tf.keras.layers.Dense(5, activation='relu')(f2)
    f3_lookup = look_up_layer(f3)
    f4_lookup = look_up_layer(f4)
    embedding_layer = tf.keras.layers.Embedding(len(look_up_layer.get_vocabulary()), 10, input_length=1)
    embedd_f3 = embedding_layer(f3_lookup)
    embedd_f4 = embedding_layer(f4_lookup)
    embedd_f3 = tf.keras.layers.GlobalMaxPool1D()(embedd_f3)
    embedd_f4 = tf.keras.layers.GlobalMaxPool1D()(embedd_f4)
    output = tf.keras.layers.Concatenate(axis=-1)([f1_dense, f2_dense, embedd_f3, embedd_f4])
    output = tf.keras.layers.Dense(3, 'softmax')(output)
    model = tf.keras.Model([f1, f2, f3, f4], output)
    
    model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy())
    model.fit([data_f1, data_f2, data_f3, data_f4], labels, batch_size=1, epochs=2)