Search code examples
machine-learningtensorflowregressionlinear-regressionlogistic-regression

Loading Boston housing dataset using TensorFlow


I am trying to understand the code example Deep Neural Network Regression with Boston Data.

The dataset is described here. It has 14 attributes.

The example uses the following code to load the data.

# Load dataset
boston = learn.datasets.load_dataset('boston')
x, y = boston.data, boston.target

When I want to know more about x and y, I have the following.

>>> type(x)
<type 'numpy.ndarray'>
>>> type(y)
<type 'numpy.ndarray'>
>>> x.shape
(506, 13)
>>> y.shape
(506,)
>>> 

My questions:

  1. Why the dataset has been divided into two objects one with 13 attributes and the other with 1?
  2. On what basis this division has been made?

Solution

  • The 13 columns in boston.data are your features. The 1 column in boston.target is your target. The reason the split is done is because most of the time, machine learning algorithms require both features and targets as separate data structures. The load_datasets function is just making it easier on you by splitting off the MDEV column, because most of the time, that's the feature that people want to predict on. To put it another way, the designers of load_data sets are assuming you want to try and find the median home prices based on the other 13 features.

    You don't have to do this. You could choose any of the features as your target. Say you wanted to predict RM, the average number of rooms per dwelling. Just merge the MDEV column back into boston.data and split out RM. Then use RM as your target.

    BTW, the link you provided was broken, so I google it and came up with this Boston Housing price tutorial. It looks pretty complete if you want to do regression in tensorflow