Search code examples
c#.netcluster-analysisml.net

How to specify during the run the Number of Features (Vector Type) in Kmeans Clustering with ML.Net


I want to use the ML.Net Kmeans algo but I do not know during compile time the size of the dataset aka the number of features.

I see that the vector type length is supposed to be a const and thus trying to pass as an argument will not work.

class Data
{ 
    public string ID{ get; set; }

    [VectorType(5)] //I do not know the if the data will contain 5 or more features
    public float[] Features { get; set; }   
}

To be used:

InputData row = new InputData { AssetID = Data[0, i + 1].ToString(), Features = features };

var context = new MLContext();
var DataView = context.Data.LoadFromEnumerable(dataArray);
string featuresColumnName = "Features";
var pipeline=context.Transforms.Concatenate(featuresColumnName,"Features")             .Append(context.Clustering.Trainers.KMeans(featuresColumnName, clustersCount: NumberClusters));

var model = pipeline.Fit(DataView);

Solution

  • If the dimension of the vector is fixed, you can work around at runtime:

     private class SampleTemperatureDataVector
        {
            public DateTime Date { get; set; }
            public float[] Temperature { get; set; }
        }
    

    notice this type has not annotations. You can create SchemaDefinition from it, than modify that schema. The initial SchemaDefinition will have the IsKnownSize property set to false. After the modification the Size will be set to the dimension you set it, 3 in this case.

            var data2 = new SampleTemperatureDataVector[]
            {
                new SampleTemperatureDataVector
                {
                    Date = DateTime.UtcNow, 
                    Temperature = new float[] {1.2f, 3.4f, 5.6f}
                },
                 new SampleTemperatureDataVector
                {
                    Date = DateTime.UtcNow,
                    Temperature = new float[] {1.2f, 3.4f, 5.6f}
                },
            };
    
            int featureDimension = 3;
            var autoSchema = SchemaDefinition.Create(typeof(SampleTemperatureDataVector));
            var featureColumn = autoSchema[1];
            var itemType = ((VectorDataViewType)featureColumn.ColumnType).ItemType;
            featureColumn.ColumnType = new VectorDataViewType(itemType, featureDimension);
    
            IDataView data3 = mlContext.Data.LoadFromEnumerable(data2, autoSchema);