Search code examples
pythonmachine-learningtensorflowlogistic-regression

How to convert numerical categorical data into Sparse tensors in tensorflow?


My dataset format is as shown below:

8,2,1,1,1,0,3,2,6,2,2,2,2
8,2,1,2,0,0,15,2,1,2,2,2,1
5,5,4,4,0,0,6,1,6,2,2,1,2
8,2,1,3,0,0,2,2,6,2,2,2,2
8,2,1,2,0,0,3,2,1,2,2,2,1
8,2,1,4,0,1,3,2,1,2,2,2,1
8,2,1,2,0,0,3,2,1,2,2,2,1
8,2,1,3,0,0,2,2,6,2,2,2,2
8,2,1,12,0,0,5,2,2,2,2,2,1
3,1,1,2,0,0,3,2,1,2,2,2,1

It consists of all categorical data, where each feature is coded numerically. I tried with the following code:

        monthly_income = tf.contrib.layers.sparse_column_with_keys("monthly_income", keys=['1','2','3','4','5','6'])
        #Other columns are also declared in the same way

        m = tf.contrib.learn.LinearClassifier(feature_columns=[
        caste, religion, differently_abled, nature_of_activity, school, dropout, qualification,
        computer_literate, monthly_income, smoke,drink,tobacco,sex],
        model_dir=model_dir)

But I am getting the following error:

TypeError: Signature mismatch. Keys must be dtype <dtype: 'string'>, got <dtype: 'int64'>.

Solution

  • I think the problem is outside the code that you shown. My guess is that the features in csv file were read as ints, but you expect them to be strings, by passing keys=['1', '2', ...].

    Nevertheless, in this situation, I recommend you to use sparse_column_with_integerized_feature:

    monthly_income = tf.contrib.layers.sparse_column_with_integerized_feature("monthly_income", bucket_size=7)