I am using the Question Pairs Dataset from Kaggle and the SdcaLogisticRegression. The version of ML.Net is 14.0
My Pc-specs:
Program.cs:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using static Microsoft.ML.DataOperationsCatalog;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms.Text;
namespace Csharp_machieneLearning
{
class Program
{
private static IDataView TransData;
public static void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet)
{
Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
IDataView predictions = model.Transform(splitTestSet);
CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "is_duplicate");
Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");
}
static void Main(string[] args)
{
MLContext mlContext = new MLContext();
Console.WriteLine($"=============== Loading Dataset ===============");
IDataView file = mlContext.Data.LoadFromTextFile<QuestionPairs>(@"C:\Users\ludwi\source\repos\Csharp_machieneLearning\questions.csv", separatorChar: ',', hasHeader: true);
Console.WriteLine($"=============== Finished Loading Dataset ===============");
IEstimator<ITransformer> pipeline = mlContext.Transforms.Conversion.ConvertType("is_duplicate", outputKind: DataKind.Boolean)
//.Append(mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "is_duplicate", outputColumnName: "Label"))
.Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized"))
.Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized"))
.Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized"))
.Append(mlContext.Transforms.NormalizeMinMax("Features"));
IEstimator<ITransformer> estimator = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "is_duplicate", featureColumnName: "Features");
var transData = pipeline.Fit(file).Transform(file);
var data = mlContext.Data.TrainTestSplit(transData, testFraction: 0.25);
var model = estimator.Fit(data.TrainSet);
Evaluate(mlContext, model, data.TestSet);
}
}
}
QuestionPairs.cs:
using System;
using System.Collections.Generic;
using System.Dynamic;
using System.Text;
using Microsoft.ML.Data;
namespace Csharp_machieneLearning
{
public class QuestionPairs
{
[LoadColumn(3)]
public string question1 { get; set; }
[LoadColumn(4)]
public string question2 { get; set; }
[LoadColumn(5)]
public string is_duplicate { get; set; }
}
public class QuestionPrediction : QuestionPairs
{
[ColumnName("PredictedLabel")]
public bool Prediction { get; set; }
public float Probability { get; set; }
public float Score { get; set; }
}
}
I though, the problem could be the
ConvertType("is_duplicate", outputKind: DataKind.Boolean)
, so I created a custom Transformer:
Action<QuestionPairs, transformOutput> mapping = (input, output) => { output.Label = input.is_duplicate.Equals("1") ? true : false; };
IEstimator<ITransformer> pipeline = mlContext.Transforms.CustomMapping(mapping, contractName: null)
.Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized"))
.Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized"))
.Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized"))
//.Append(mlContext.Transforms.NormalizeMinMax("Features"))
//.AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: nameof(customTransform.Label), featureColumnName: "Features"));
This did not seem to help, so I was wondering if the program is loading the dataset correctly at all.
Therfor I added a
Preview()
function.
var file = pipeline.Preview(10);
foreach(var row in preview.RowView)
{
foreach(var column in row.Values)
{
Console.WriteLine(column);
}
Console.WriteLine("=============================================================");
}
Output: As you can see, the "is_dublicat" column sometimes contains a string which is meant to be part of the features. This is caused by an "," used in the feature sentence.
After a quick search, I found the allowQuoting: true
attribute fo the LoadFromTextFile()
function and the result seems as it intended to be:
After running the full code, the results are as they should be.