Search code examples
pythonnumpyscikit-learnsvm

SVC python output showing the same value of "1" for every C or gamma used


This is the code:

import numpy as np
from sklearn import svm
numere=np.fromfile("sat.trn",dtype=int,count=-1,sep=" ")
numereTest=np.fromfile("sat.tst",dtype=int,count=-1,sep=" ")
numere=numere.reshape(int(len(numere)/37),37)
numereTest=numereTest.reshape(int(len(numereTest)/37),37)
etichete=numere[0:int(len(numere)),36]
eticheteTest=numereTest[0:int(len(numereTest)),36]
numere=np.delete(numere,36,1)
numereTest=np.delete(numereTest,36,1)
clf=svm.SVC(kernel='rbf',C=1,gamma=1)
clf.fit(numere,etichete)
predictie=clf.predict(numereTest)

I took the data from a file that has it all and then I made 2 np.arrays with them, but the output is 1 everything I do.

numere(:10)-->array([[ 92, 115, 120, 94, 84, 102, 106, 79, 84, 102, 102, 83, 101, 126, 133, 103, 92, 112, 118, 85, 84, 103, 104, 81, 102, 126, 134, 104, 88, 121, 128, 100, 84, 107, 113, 87], [ 84, 102, 106, 79, 84, 102, 102, 83, 80, 102, 102, 79, 92, 112, 118, 85, 84, 103, 104, 81, 84, 99, 104, 78, 88, 121, 128, 100, 84, 107, 113, 87, 84, 99, 104, 79], [ 84, 102, 102, 83, 80, 102, 102, 79, 84, 94, 102, 79, 84, 103, 104, 81, 84, 99, 104, 78, 84, 99, 104, 81, 84, 107, 113, 87, 84, 99, 104, 79, 84, 99, 104, 79], [ 80, 102, 102, 79, 84, 94, 102, 79, 80, 94, 98, 76, 84, 99, 104, 78, 84, 99, 104, 81, 76, 99, 104, 81, 84, 99, 104, 79, 84, 99, 104, 79, 84, 103, 104, 79], [ 84, 94, 102, 79, 80, 94, 98, 76, 80, 102, 102, 79, 84, 99, 104, 81, 76, 99, 104, 81, 76, 99, 108, 85, 84, 99, 104, 79, 84, 103, 104, 79, 79, 107, 109, 87], [ 80, 94, 98, 76, 80, 102, 102, 79, 76, 102, 102, 79, 76, 99, 104, 81, 76, 99, 108, 85, 76, 103, 118, 88, 84, 103, 104, 79, 79, 107, 109, 87, 79, 107, 109, 87], [ 76, 102, 106, 83, 76, 102, 106, 87, 80, 98, 106, 79, 80, 107, 118, 88, 80, 112, 118, 88, 80, 107, 113, 85, 79, 107, 113, 87, 79, 103, 104, 83, 79, 103, 104, 79], [ 76, 102, 106, 87, 80, 98, 106, 79, 76, 94, 102, 76, 80, 112, 118, 88, 80, 107, 113, 85, 80, 95, 100, 78, 79, 103, 104, 83, 79, 103, 104, 79, 79, 95, 100, 79], [ 76, 89, 98, 76, 76, 94, 98, 76, 76, 98, 102, 72, 80, 95, 104, 74, 76, 91, 104, 74, 76, 95, 100, 78, 75, 91, 96, 75, 75, 91, 96, 71, 79, 87, 93, 71], [ 76, 94, 98, 76, 76, 98, 102, 72, 76, 94, 90, 76, 76, 91, 104, 74, 76, 95, 100, 78, 76, 91, 100, 74, 75, 91, 96, 71, 79, 87, 93, 71, 79, 87, 93, 67]])


Solution

  • Ok so the most likely reason for what you get is:

    Firstly you do not use scaling for the data, try to use standard scaler.

    scaler = StandardScaler()
    scaler.fit(numere)
    numere = scaler.transform(numere)
    numereTest = scaler.transform(numereTest)
    

    Secondly you are not tuning your parameters, you need to select the best fitting parameters, I strongly recommend using grid search. You can find an example how to use it here. Grid search is good for parameter tuning but take care to not use cross validation in this dataset, that is recommendation from its creators :) Gamma and C can get to wide values from very low decimal numbers to very high numbers, you can't test it properly manually.

    Edit: you should not use CV so this is better way for you to do grid search

    grid = { #edit ´this with more values
        'gamma': [0.001, 0.1, 10, 100, 1000, ],
        'C': [1, 10, 100]
    }
    
    for g in ParameterGrid(grid):
        clf.set_params(**g)
        clf.fit(numere, etichete)
        # save if best
        score = clf.score(numereTest, eticheteTest)
        if score > best_score:
            best_score = score
            best_grid = g
    
    print ("best score:", best_score) 
    print ("Grid:", best_grid)