Search code examples
ml.netml.net-model-builder

What ML.NET Concatenate really does?


I believe I understand when Concatenate needs to be called, on what data and why. What I'm trying to understand is what physically happens to the input columns data when Concatenate is called.

Is this some kind of a hash function that hashes all the input data from the columns and generates a result?

In other words, I would like to know if that is technically possible to restore original values from the value generated by Concatenate?

Is the order of data columns passed into Concatenate affects the resulting model and in what way?

Why I'm asking all that. I'm trying to understand what input parameters and in what way affect the quality of the produced model. I have many input columns of data. They are all rather important and it is important the relation between those values. If Concatenate does something simple and loses the relations between values I would try one approach to improve the quality of the model. If it is rather complex and keeps details of the values I would use other approaches.


Solution

  • In ML.NET, Concatenate takes individual features (of the same type) and creates a feature vector.

    In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, while when representing texts the features might be the frequencies of occurrence of textual terms. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression.

    To my understanding there's no hashing involved. Conceptually you can think of it like the String.Join method, where you're taking individual elements and join them into one. In this case, that single component is a feature vector that as a whole represents the underlying data as an array of type T where T is the data type of the individual columns.

    As a result, you can always access the individual components and order should not matter.

    Here's an example using F# that takes data, creates a feature vector using the concatenate transform, and accesses the individual components:

    #r "nuget:Microsoft.ML"
    
    open Microsoft.ML
    open Microsoft.ML.Data
    
    // Raw data
    let housingData = 
        seq {
            {| NumRooms = 3f; NumBaths = 2f ; SqFt = 1200f|}
            {| NumRooms = 2f; NumBaths = 1f ; SqFt = 800f|}
            {| NumRooms = 6f; NumBaths = 7f ; SqFt = 5000f|}
        }
    
    // Initialize MLContext
    let ctx = new MLContext()
    
    // Load data into IDataView
    let dataView = ctx.Data.LoadFromEnumerable(housingData)
    
    // Get individual column names (NumRooms, NumBaths, SqFt)
    let columnNames = 
        dataView.Schema 
        |> Seq.map(fun col -> col.Name)
        |> Array.ofSeq
    
    // Create pipeline with concatenate transform
    let pipeline = ctx.Transforms.Concatenate("Features", columnNames)
    
    // Fit data to pipeline and apply transform
    let transformedData = pipeline.Fit(dataView).Transform(dataView)
    
    // Get "Feature" column containing the result of applying Concatenate transform
    let features = transformedData.GetColumn<float32 array>("Features")
    
    // Deconstruct feature vector and print out individual features
    printfn "Rooms | Baths | Sqft"
    for [|rooms;baths;sqft|] in features do
        printfn $"{rooms} | {baths} | {sqft}"
    

    The result output to the console is:

    Rooms | Baths | Sqft
    2 | 3 | 1200
    1 | 2 | 800
    7 | 6 | 5000
    

    If you're looking to understand the impact individual features have on your model, I'd suggest looking at Permutation Feature Importance (PFI) and Feature Contribution Calculation