Search code examples
pythontensorflowmachine-learningkerasprediction

How to make a prediction using a model based on csv dataset?


Following tutorial, I made a neural network which dataset comes from csv file made by me. It is simple dataset, which contains first exam result, second exam result, third exam result and nationality of each student. The goal is to predict third exam result using first and second exam result and nationality. Here is how the code looks like.

column_names = ['First exam result', 'Second exam result', 'Third exam result', 'Country']
dataset = pd.read_csv('data1.csv', names=column_names, sep=';')
dataset = dataset.dropna()  # clean data

# convert categorical 'Country' data into one-hot data
dataset.Country=pd.Categorical(dataset.Country, ['PL', 'ENG'], ordered=True)
dataset.Country=dataset.Country.cat.codes

# split data
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('Third exam result')
test_labels = test_features.pop('Third exam result')

# Normalize
normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))

loss = keras.losses.MeanAbsoluteError()

linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(units=1)])

linear_model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.1), loss=loss)

linear_model.fit(
    train_features, train_labels,
    epochs=500,
    verbose=1,
    # Calculate validation results on 20% of the training data
    validation_split=0.2)

linear_model.evaluate(
    test_features, test_labels, verbose=1)


Now I want to make a prediction using testdata.csv file which contains all the information except the third exam result but I don't know how to do that.

prediction_data = pd.read_csv('testdata.csv', names=column_names, sep=';')

Solution

  • You need to do the same operations with the test dataset

    prediction_data.dropna(inplace=True)
    
    prediction_data.Country=pd.Categorical(prediction_data.Country, ['PL', 'ENG'], ordered=True)
    prediction_data.Country=prediction_data.Country.cat.codes
    normalizer.adapt(np.array(prediction_data)) #You need normalize test data too
    
    predict = linear_model.predict(prediction_data)