Search code examples
machine-learningvectorizationoctavelogistic-regressionmulticlass-classification

Vectorizing labels detection of a vector (dataset) in Octave for multi class Logistic regression


While implementing Logistic regression with multi-features and with multi classes (my chosen data set has classes 1,2,3,4 and 5) of the m (>100) sample data with classes between 1 and 5. I tried to find out the no. of unique labels/classes and also put them as a vector. I could write the below code with Y as a column vector of size (m,1)

classes = [Y(1,1)]; #Initializing classes
for i = 2:m
    count = 0;
    for j = 1:length(classes)
        if Y(i,1) == classes(j,1)
            count = count + 1;
        end;
    end
    if count ==0
        classes = [classes; Y(i,1)];
    end
end

This gave me the list of unique labels in the vector Y. However, I was wondering if there's any better way of writing this code (the above lines of code appears childish to me), especially by vectorization. Any suggestions are welcome. Thanks.


Solution

  • It appears that if the purpose of the code is just to generate a list of the unique values in Y, you could just use unique(Y). for example:

    >> m = 10;
    >> Y = floor(rand(m,1)*5+1)
    Y =
    
       5
       1
       5
       4
       2
       2
       1
       5
       1
       4
    
    >> unique(Y)
    ans =
    
       1
       2
       4
       5    
    

    now, the output of your function has them in order they first appear in the list. e.g.,

    classes = 
    
       5
       1
       4
       2
    

    if that is important, you'll need something like this:

    >> [sortedClasses idx] = unique(Y,"first")
    sortedClasses =
    
       1
       2
       4
       5
    
    idx =
    
       2
       5
       4
       1
    
    >> unsortedClasess = Y(sort(idx))
    unsortedClasess =
    
       5
       1
       4
       2
    

    both unique and sort are fairly well vectorized for speed. And removing the repeated expansion of classes will prevent repeated variable copying that would impose significant overhead if you had a very large number of classes.