Search code examples
pythoncsvnumpytensorflowrecurrent-neural-network

CSV >> Tensorflow >> regression (via neural network) model


TLDR; 1) read and convert CSV data to image, 2) create regression model from data. Note that I was very new to python, deep learning, and Stackoverflow in 2016. Please vote to close this. I think it's too outdated.

Original question below...

Endless Googling has left me better educated on Python and numpy, but still clueless on solving my task. I want to read a CSV of integer/floating point values and predict a value using a neural network. I have found several examples that read the Iris dataset and do classification, but I don't understand how to make them work for regression. Can someone help me connect the dots?

Here is one line of the input:

16804,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.490265,0.620805,0.54977,0.869299,0.422268,0.351223,0.33572,0.68308,0.40455,0.47779,0.307628,0.301921,0.318646,0.365993,6135.81

That should be 925 values. The last column is the output. The first is the RowID. Most are binary values because I've already done one-hot encoding. The test files do not have the output/last column. The full training file has around 10M rows. A general MxN solution will do.

Edit: Let's use this sample data since Iris is a classification problem, but note that the above is my real target. I removed the ID column. Let's predict the last column given the 6 other columns. This has 45 rows. (src: http://www.stat.ufl.edu/~winner/data/civwar2.dat)

100,1861,5,2,3,5,38 112,1863,11,7,4,59.82,15.18 113,1862,34,32,1,79.65,2.65 90,1862,5,2,3,68.89,5.56 93,1862,14,10,4,61.29,17.2 179,1862,22,19,3,62.01,8.89 99,1861,22,16,6,67.68,27.27 111,1862,16,11,4,78.38,8.11 107,1863,17,11,5,60.75,5.61 156,1862,32,30,2,60.9,12.82 152,1862,23,21,2,73.55,6.41 72,1863,7,3,3,54.17,20.83 134,1862,22,21,1,67.91,9.7 180,1862,23,16,4,69.44,3.89 143,1863,23,19,4,81.12,8.39 110,1862,16,12,2,31.82,9.09 157,1862,15,10,5,52.23,24.84 101,1863,4,1,3,58.42,18.81 115,1862,14,11,3,86.96,5.22 103,1862,7,6,1,70.87,0 90,1862,11,11,0,70,4.44 105,1862,20,17,3,80,4.76 104,1862,11,9,1,29.81,9.62 102,1862,17,10,7,49.02,6.86 112,1862,19,14,5,26.79,14.29 87,1862,6,3,3,8.05,72.41 92,1862,4,3,0,11.96,86.96 108,1862,12,7,3,16.67,25 86,1864,0,0,0,2.33,11.63 82,1864,4,3,1,81.71,8.54 76,1864,1,0,1,48.68,6.58 79,1864,0,0,0,15.19,21.52 85,1864,1,1,0,89.41,3.53 85,1864,1,1,0,56.47,0 85,1864,0,0,0,31.76,15.29 87,1864,6,5,0,81.61,3.45 85,1864,5,5,0,72.94,0 83,1864,0,0,0,46.99,2.38 101,1864,5,5,0,1.98,95.05 99,1864,6,6,0,42.42,9.09 10,1864,0,0,0,50,9 98,1864,6,6,0,79.59,3.06 10,1864,0,0,0,71,9 78,1864,5,5,0,70.51,1.28 89,1864,4,4,0,59.55,13.48

Let me add that this is a common task, but seems to not be answered by any forums I've read thus I've asked this. I could give you my broken code, but I don't want to waste your time with code that is not functionally correct. Sorry I've asked it this way. I just don't understand the APIs and the documentation doesn't tell me the data types.

Here is the latest code I have that reads the CSV into two ndarrays:

#!/usr/bin/env python
import tensorflow as tf
import csv
import numpy as np
from numpy import genfromtxt

# Build Example Data is CSV format, but use Iris data
from sklearn import datasets
from sklearn.cross_validation import train_test_split
import sklearn
def buildDataFromIris():
    iris = datasets.load_iris()
    data = np.loadtxt(open("t100.csv.out","rb"),delimiter=",",skiprows=0)
    labels = np.copy(data)
    labels = labels[:,924]
    print "labels: ", type (labels), labels.shape, labels.ndim
    data = np.delete(data, [924], axis=1)
    print "data: ", type (data), data.shape, data.ndim

And here is base code that I want to use. The example this came from wasn't complete either. The APIs in the links below are vague. If I can at least figure out the data types input into DNNRegressor and the others in the docs, I might be able to write some custom code.

estimator = DNNRegressor(
    feature_columns=[education_emb, occupation_emb],
    hidden_units=[1024, 512, 256])

# Or estimator using the ProximalAdagradOptimizer optimizer with
# regularization.
estimator = DNNRegressor(
    feature_columns=[education_emb, occupation_emb],
    hidden_units=[1024, 512, 256],
    optimizer=tf.train.ProximalAdagradOptimizer(
      learning_rate=0.1,
      l1_regularization_strength=0.001
    ))

# Input builders
def input_fn_train: # returns x, Y
  pass
estimator.fit(input_fn=input_fn_train)

def input_fn_eval: # returns x, Y
  pass
estimator.evaluate(input_fn=input_fn_eval)
estimator.predict(x=x)

And then the big question is how to get these to work together.

Here are a few pages I've been looking at.


Solution

  • I've found lower-level Tensorflow pretty hard to figure out in the past as well. And the documentation hasn't been amazing. If you instead focus on getting the hang of sklearn, you should find it relatively easy to work with skflow. skflow is at a much higher level than tensorflow and it has almost the same api is sklearn.

    Now to the answer:

    As a regression example, we'll just perform regression on the iris dataset. Now this is a silly idea, but it's just to demonstrate how to use DNNRegressor.

    Skflow API

    The first time you use a new API, try to use as few parameters as possible. You just want to get something working. So, I propose you can set up a DNNRegressor like this:

    estimator = skflow.DNNRegressor(hidden_units=[16, 16])
    

    I kept my # hidden units small because I don't have much computational power right now.

    Then you give it the training data, train_X, and training labels train_y and you fit it as follows:

    estimator.fit(train_X, train_y)
    

    This is the standard procedure for all sklearn classifiers and regressors and skflow just extends tensorflow to be similar to sklearn. I also set the parameter steps = 10 so that the training finishes faster when it only runs for 10 iterations.

    Now, if you want it to predict on some new data, test_X, you do that as follows:

    pred = estimator.predict(test_X)
    

    Again, this is standard procedure for all sklearn code. So that's it - skflow is so simplified you just need those three lines!

    What's the format of train_X and train_y?

    If you aren't too familiar with machine learning, your training data is generally an ndarray (matrix) of size M x d where you have M training examples and d features. Your labels are M x 1 (ndarray of shape (M,)).

    So what you have is something like this:

    Features:   Sepal Width    Sepal Length ...               Labels
              [   5.1            2.5             ]         [0 (setosa)     ]
      X =     [   2.3            2.4             ]     y = [1 (virginica)  ]
              [   ...             ...            ]         [    ....       ]
              [   1.3            4.5             ]         [2 (Versicolour)]
    

    (note I just made all those numbers up).

    The test data will just be an N x d matrix where you have N test examples. The test examples all need to have d features. The predict function will take in the test data and return to you the test labels of shape N x 1 (ndarray of shape (N,))

    You didn't supply your .csv file so I'll let you parse the data into that format. Conveniently though, we can use sklearn.datsets.load_iris() to get the X and y we want. It's just

    iris = datasets.load_iris()
    X = iris.data 
    y = iris.target
    

    Using a Regressor as a Classifier

    The output of your DNNRegressor will be a bunch of real numbers (like 1.6789). But the iris-dataset has labels 0, 1, and 2 - the integer IDs for Setosa, Versicolour, and Virginia. To perform a classification with this regressor, we will just round it to the nearest label (0, 1, 2). For example, a prediction of 1.6789 will round to 2.

    Working Example

    I find I learn the most with a working example. So here's a very simplified working example:

    enter image description here

    Feel free to post any further questions as a comment.