Search code examples
pythonpandasmachine-learningneural-networktflearn

Preprocessing csv files to use with tflearn


My question is about preprocessing csv files before inputing them into a neural network.

I want to build a deep neural network for the famous iris dataset using tflearn in python 3.

Dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

I'm using tflearn to load the csv file. However, the classes column of my data set has words such as iris-setosa, iris-versicolor, iris-virginica.

Nueral networks work only with numbers. So, I have to find a way to change the classes from words to numbers. Since it is a very small dataset, I can do it manually using Excel/text editor. I manually assigned numbers for different classes.

But, I can't possibly do it for every dataset I work with. So, I tried using pandas to perform one hot encoding.

preprocess_data = pd.read_csv("F:\Gautam\.....\Dataset\iris_data.csv")
preprocess_data = pd.get_dummies(preprocess_data)

But now, I can't use this piece of code:

data, labels = load_csv('filepath', categorical_labels=True,
                     n_classes=3)

'filepath' should only be a directory to the csv file, not any variable like preprocess_data.

Original Dataset:

     Sepal Length  Sepal Width  Petal Length  Petal Width  Class
89            5.5          2.5           4.0          1.3  iris-versicolor
85            6.0          3.4           4.5          1.6  iris-versicolor
31            5.4          3.4           1.5          0.4  iris-setosa
52            6.9          3.1           4.9          1.5  iris-versicolor
111           6.4          2.7           5.3          1.9  iris-virginica

Manually modified dataset:

     Sepal Length  Sepal Width  Petal Length  Petal Width  Class
89            5.5          2.5           4.0          1.3      1
85            6.0          3.4           4.5          1.6      1
31            5.4          3.4           1.5          0.4      0
52            6.9          3.1           4.9          1.5      1
111           6.4          2.7           5.3          1.9      2

Here's my code which runs perfectly, but, I have modified the dataset manually.

import numpy as np
import pandas as pd
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_csv


data_source = 'F:\Gautam\.....\Dataset\iris_data.csv'

data, labels = load_csv(data_source, categorical_labels=True,
                         n_classes=3)


network = input_data(shape=[None, 4], name='InputLayer')

network = fully_connected(network, 9, activation='sigmoid', name='Hidden_Layer_1')

network = fully_connected(network, 3, activation='softmax', name='Output_Layer')

network = regression(network, batch_size=1, optimizer='sgd', learning_rate=0.2)

model = tflearn.DNN(network)
model.fit(data, labels, show_metric=True, run_id='iris_dataset', validation_set=0.1, n_epoch=2000)

I want to know if there's any other built-in function in tflearn (or in any other module, for that matter) that I can use to modify the value of my classes from words to numbers. I don't think manually modifying the datasets would be productive.

I'm a beginner in tflearn and neural networks also. Any help would be appreciated. Thanks.


Solution

  • Use label encoder from sklearn library:

    from sklearn.preprocessing import LabelEncoder,OneHotEncoder
    
    df = pd.read_csv('iris_data.csv',header=None)
    df.columns=[Sepal Length,Sepal Width,Petal Length,Petal Width,Class]
    
    enc=LabelEncoder()
    df['Class']=enc.fit_transform(df['Class'])
    print df.head(5)
    

    if you want One-hot encoding then first you need to labelEncode then do OneHotEncoding :

    enc=LabelEncoder()
    enc_1=OneHotEncoder()
    df['Class']=enc.fit_transform(df['Class'])
    df['Class']=enc_1.fit_transform([df['Class']]).toarray()
    print df.head(5)
    

    These encoders first sort the words in alphabetical order then assign them labels. If you want to see which label is assigned to which class, do:

    for k in list(enc.classes_) :
       print 'name ::{}, label ::{}'.format(k,enc.transform([k]))
    

    If you want to save this dataframe as a csv file, do:

    df.to_csv('Processed_Irisdataset.csv',sep=',')