Search code examples
tensorflowsparse-matrixcategorical-data

Can tensorflow handle categorical features with multiple inputs within one column?


For example, I have data in the following csv format:

1, 2, 1:3:4, 2

0, 1, 3:5, 1

...

Each column seperated by comma represent one feature. Normally, a feature is one-hot(e.g. col0, col1, col3), but in this case, the feature for col2 has multiple inputs(seperated by colon).

I'm sure tensorflow can handle one-hot feature with sparse tensor, but I'm not sure if it could handle feature with multiple inputs like col2?

And if ok, how should it be represented in tensorflow's sparse tensor?


Solution

  • TensorFlow has some string processing ops which can handle lists within CSVs. I'd read the list as a string column first, the process it like this:

    def process_list_column(list_column, dtype=tf.float32):
      sparse_strings = tf.string_split(list_column, delimiter=":")
      return tf.SparseTensor(indices=sparse_strings.indices,
                             values=tf.string_to_number(sparse_strings.values,
                                                        out_type=dtype),
                             dense_shape=sparse_strings.dense_shape)
    

    An example of using this function:

    # csv_input.csv contains:
    # 1,2,1:3:4,2
    # 0,1,3:5,1
    filename_queue = tf.train.string_input_producer(["csv_input.csv"])
    # Read two lines, batched
    _, lines = tf.TextLineReader().read_up_to(filename_queue, 2)
    columns = tf.decode_csv(lines, record_defaults=[[0], [0], [""], [0]])
    columns[2] = process_list_column(columns[2], dtype=tf.int32)
    
    with tf.Session() as session:
      coordinator = tf.train.Coordinator()
      tf.train.start_queue_runners(session, coord=coordinator)
    
      print(session.run(columns))
    
      coordinator.request_stop()
      coordinator.join()
    

    Outputs:

    [array([1, 0], dtype=int32), 
     array([2, 1], dtype=int32), 
     SparseTensorValue(indices=array([[0, 0],
           [0, 1],
           [0, 2],
           [1, 0],
           [1, 1]]), 
         values=array([1, 3, 4, 3, 5], dtype=int32), 
         dense_shape=array([2, 3])),
     array([2, 1], dtype=int32)]