Search code examples
c#machine-learning.net-coreml.net

ML.Net Accuracy always 100%


I am using the Question Pairs Dataset from Kaggle and the SdcaLogisticRegression. The version of ML.Net is 14.0

My Pc-specs:

  1. Operating system Microsoft Windows 10 Pro
  2. Systemtyp x64-basierter PC
  3. RAM 32,0 GB
  4. CPU Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, 2208 MHz, 6 Kern(e), 12 logische(r) Prozessor(en)

Program.cs:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using static Microsoft.ML.DataOperationsCatalog;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms.Text;

namespace Csharp_machieneLearning
{
    class Program
    {
        private static IDataView TransData;

        public static void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet)
        {
            Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
            IDataView predictions = model.Transform(splitTestSet);
            CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "is_duplicate");
            Console.WriteLine();
            Console.WriteLine("Model quality metrics evaluation");
            Console.WriteLine("--------------------------------");
            Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
            Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}");
            Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
            Console.WriteLine("=============== End of model evaluation ===============");
        }


        static void Main(string[] args)
        {
            MLContext mlContext = new MLContext();
            Console.WriteLine($"=============== Loading Dataset  ===============");
            IDataView file = mlContext.Data.LoadFromTextFile<QuestionPairs>(@"C:\Users\ludwi\source\repos\Csharp_machieneLearning\questions.csv", separatorChar: ',', hasHeader: true);
            Console.WriteLine($"=============== Finished Loading Dataset  ===============");
            IEstimator<ITransformer> pipeline = mlContext.Transforms.Conversion.ConvertType("is_duplicate", outputKind: DataKind.Boolean)
                            //.Append(mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "is_duplicate", outputColumnName: "Label"))
                            .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized"))
                            .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized"))
                            .Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized"))
                            .Append(mlContext.Transforms.NormalizeMinMax("Features"));

            IEstimator<ITransformer> estimator = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "is_duplicate", featureColumnName: "Features");

            var transData = pipeline.Fit(file).Transform(file);
            var data = mlContext.Data.TrainTestSplit(transData, testFraction: 0.25);
            var model = estimator.Fit(data.TrainSet);

            Evaluate(mlContext, model, data.TestSet);
        }
    }
}

QuestionPairs.cs:

using System;
using System.Collections.Generic;
using System.Dynamic;
using System.Text;
using Microsoft.ML.Data;

namespace Csharp_machieneLearning
{
    public class QuestionPairs
    {
        [LoadColumn(3)]
        public string question1 { get; set; }

        [LoadColumn(4)]
        public string question2 { get; set; }

        [LoadColumn(5)]
        public string is_duplicate { get; set; }

    }
    public class QuestionPrediction : QuestionPairs
    {
        [ColumnName("PredictedLabel")]
        public bool Prediction { get; set; }

        public float Probability { get; set; }

        public float Score { get; set; }
    }
}

Output: enter image description here


Solution

  • I though, the problem could be the ConvertType("is_duplicate", outputKind: DataKind.Boolean), so I created a custom Transformer:

    Action<QuestionPairs, transformOutput> mapping = (input, output) => { output.Label = input.is_duplicate.Equals("1") ? true : false; };
                IEstimator<ITransformer> pipeline = mlContext.Transforms.CustomMapping(mapping, contractName: null)
                                .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized"))
                                .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized"))
                                .Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized"))
                                //.Append(mlContext.Transforms.NormalizeMinMax("Features"))
                                //.AppendCacheCheckpoint(mlContext)
                                .Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: nameof(customTransform.Label), featureColumnName: "Features"));
    

    This did not seem to help, so I was wondering if the program is loading the dataset correctly at all.

    Therfor I added a Preview()function.

    var file = pipeline.Preview(10);
    
                foreach(var row in preview.RowView)
                {
                    foreach(var column in row.Values)
                    {
                        Console.WriteLine(column);
                    }
                    Console.WriteLine("=============================================================");
                }
    

    Output: output As you can see, the "is_dublicat" column sometimes contains a string which is meant to be part of the features. This is caused by an "," used in the feature sentence.

    After a quick search, I found the allowQuoting: true attribute fo the LoadFromTextFile() function and the result seems as it intended to be: output of the good data After running the full code, the results are as they should be.

    enter image description here