TLDR; 1) read and convert CSV data to image, 2) create regression model from data. Note that I was very new to python, deep learning, and Stackoverflow in 2016. Please vote to close this. I think it's too outdated.
Endless Googling has left me better educated on Python and numpy, but still clueless on solving my task. I want to read a CSV of integer/floating point values and predict a value using a neural network. I have found several examples that read the Iris dataset and do classification, but I don't understand how to make them work for regression. Can someone help me connect the dots?
Here is one line of the input:
16804,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.490265,0.620805,0.54977,0.869299,0.422268,0.351223,0.33572,0.68308,0.40455,0.47779,0.307628,0.301921,0.318646,0.365993,6135.81
That should be 925 values. The last column is the output. The first is the RowID. Most are binary values because I've already done one-hot encoding. The test files do not have the output/last column. The full training file has around 10M rows. A general MxN solution will do.
Edit: Let's use this sample data since Iris is a classification problem, but note that the above is my real target. I removed the ID column. Let's predict the last column given the 6 other columns. This has 45 rows. (src: http://www.stat.ufl.edu/~winner/data/civwar2.dat)
100,1861,5,2,3,5,38 112,1863,11,7,4,59.82,15.18 113,1862,34,32,1,79.65,2.65 90,1862,5,2,3,68.89,5.56 93,1862,14,10,4,61.29,17.2 179,1862,22,19,3,62.01,8.89 99,1861,22,16,6,67.68,27.27 111,1862,16,11,4,78.38,8.11 107,1863,17,11,5,60.75,5.61 156,1862,32,30,2,60.9,12.82 152,1862,23,21,2,73.55,6.41 72,1863,7,3,3,54.17,20.83 134,1862,22,21,1,67.91,9.7 180,1862,23,16,4,69.44,3.89 143,1863,23,19,4,81.12,8.39 110,1862,16,12,2,31.82,9.09 157,1862,15,10,5,52.23,24.84 101,1863,4,1,3,58.42,18.81 115,1862,14,11,3,86.96,5.22 103,1862,7,6,1,70.87,0 90,1862,11,11,0,70,4.44 105,1862,20,17,3,80,4.76 104,1862,11,9,1,29.81,9.62 102,1862,17,10,7,49.02,6.86 112,1862,19,14,5,26.79,14.29 87,1862,6,3,3,8.05,72.41 92,1862,4,3,0,11.96,86.96 108,1862,12,7,3,16.67,25 86,1864,0,0,0,2.33,11.63 82,1864,4,3,1,81.71,8.54 76,1864,1,0,1,48.68,6.58 79,1864,0,0,0,15.19,21.52 85,1864,1,1,0,89.41,3.53 85,1864,1,1,0,56.47,0 85,1864,0,0,0,31.76,15.29 87,1864,6,5,0,81.61,3.45 85,1864,5,5,0,72.94,0 83,1864,0,0,0,46.99,2.38 101,1864,5,5,0,1.98,95.05 99,1864,6,6,0,42.42,9.09 10,1864,0,0,0,50,9 98,1864,6,6,0,79.59,3.06 10,1864,0,0,0,71,9 78,1864,5,5,0,70.51,1.28 89,1864,4,4,0,59.55,13.48
Let me add that this is a common task, but seems to not be answered by any forums I've read thus I've asked this. I could give you my broken code, but I don't want to waste your time with code that is not functionally correct. Sorry I've asked it this way. I just don't understand the APIs and the documentation doesn't tell me the data types.
Here is the latest code I have that reads the CSV into two ndarrays:
#!/usr/bin/env python
import tensorflow as tf
import csv
import numpy as np
from numpy import genfromtxt
# Build Example Data is CSV format, but use Iris data
from sklearn import datasets
from sklearn.cross_validation import train_test_split
import sklearn
def buildDataFromIris():
iris = datasets.load_iris()
data = np.loadtxt(open("t100.csv.out","rb"),delimiter=",",skiprows=0)
labels = np.copy(data)
labels = labels[:,924]
print "labels: ", type (labels), labels.shape, labels.ndim
data = np.delete(data, [924], axis=1)
print "data: ", type (data), data.shape, data.ndim
And here is base code that I want to use. The example this came from wasn't complete either. The APIs in the links below are vague. If I can at least figure out the data types input into DNNRegressor and the others in the docs, I might be able to write some custom code.
estimator = DNNRegressor(
feature_columns=[education_emb, occupation_emb],
hidden_units=[1024, 512, 256])
# Or estimator using the ProximalAdagradOptimizer optimizer with
# regularization.
estimator = DNNRegressor(
feature_columns=[education_emb, occupation_emb],
hidden_units=[1024, 512, 256],
optimizer=tf.train.ProximalAdagradOptimizer(
learning_rate=0.1,
l1_regularization_strength=0.001
))
# Input builders
def input_fn_train: # returns x, Y
pass
estimator.fit(input_fn=input_fn_train)
def input_fn_eval: # returns x, Y
pass
estimator.evaluate(input_fn=input_fn_eval)
estimator.predict(x=x)
And then the big question is how to get these to work together.
Here are a few pages I've been looking at.
I've found lower-level Tensorflow pretty hard to figure out in the past as well. And the documentation hasn't been amazing. If you instead focus on getting the hang of sklearn
, you should find it relatively easy to work with skflow
. skflow
is at a much higher level than tensorflow
and it has almost the same api is sklearn
.
Now to the answer:
As a regression example, we'll just perform regression on the iris dataset. Now this is a silly idea, but it's just to demonstrate how to use DNNRegressor
.
The first time you use a new API, try to use as few parameters as possible. You just want to get something working. So, I propose you can set up a DNNRegressor
like this:
estimator = skflow.DNNRegressor(hidden_units=[16, 16])
I kept my # hidden units small because I don't have much computational power right now.
Then you give it the training data, train_X
, and training labels train_y
and you fit it as follows:
estimator.fit(train_X, train_y)
This is the standard procedure for all sklearn
classifiers and regressors and skflow
just extends tensorflow
to be similar to sklearn
. I also set the parameter steps = 10
so that the training finishes faster when it only runs for 10 iterations.
Now, if you want it to predict on some new data, test_X
, you do that as follows:
pred = estimator.predict(test_X)
Again, this is standard procedure for all sklearn
code. So that's it - skflow
is so simplified you just need those three lines!
If you aren't too familiar with machine learning, your training data is generally an ndarray
(matrix) of size M x d where you have M training examples and d features. Your labels are M x 1 (ndarray
of shape (M,)
).
So what you have is something like this:
Features: Sepal Width Sepal Length ... Labels
[ 5.1 2.5 ] [0 (setosa) ]
X = [ 2.3 2.4 ] y = [1 (virginica) ]
[ ... ... ] [ .... ]
[ 1.3 4.5 ] [2 (Versicolour)]
(note I just made all those numbers up).
The test data will just be an N x d matrix where you have N test examples. The test examples all need to have d features. The predict function will take in the test data and return to you the test labels of shape N x 1 (ndarray
of shape (N,)
)
You didn't supply your .csv file so I'll let you parse the data into that format. Conveniently though, we can use sklearn.datsets.load_iris()
to get the X
and y
we want. It's just
iris = datasets.load_iris()
X = iris.data
y = iris.target
The output of your DNNRegressor
will be a bunch of real numbers (like 1.6789). But the iris-dataset has labels 0, 1, and 2 - the integer IDs for Setosa, Versicolour, and Virginia. To perform a classification with this regressor, we will just round it to the nearest label (0, 1, 2). For example, a prediction of 1.6789 will round to 2.
I find I learn the most with a working example. So here's a very simplified working example:
Feel free to post any further questions as a comment.