python numpy feature-extraction feature-selection

How to select feature sizes

im trying to replicate an experiment on a paper using SVM, to increment my learning/knownledge on machine learning. In this paper, the author extracts the features and chooses the feature sizes. He, then shows a table where F represents the size of the feature vector and N represents the face images

He then works with F >= 9 and N >= 15 parameters.

Now, what i want to do is to actually grab the features i extract as he does in the paper.

Basically, this is how i extract the features:

def load_image_files(fullpath, dimension=(64, 64)):
    descr = "A image classification dataset"
    images = []
    flat_data = []
    target = []
    dimension=(64, 64)
    for category in CATEGORIES:
        path = os.path.join(DATADIR, category)
        for person in os.listdir(path):
            personfolder = os.path.join(path, person)
            for imgname in os.listdir(personfolder):
                class_num = CATEGORIES.index(category)
                fullpath = os.path.join(personfolder, imgname)
                img_resized = resize(skimage.io.imread(fullpath), dimension, anti_aliasing=True, mode='reflect')
                flat_data.append(img_resized.flatten())
                images.append(skimage.io.imread(fullpath))
                target.append(class_num)

    flat_data = np.array(flat_data)
    target = np.array(target)
    images = np.array(images)
    print(CATEGORIES)

    return Bunch(data=flat_data,
                     target=target,
                     target_names=category,
                     images=images,
                     DESCR=descr)

How do i select the amount of features extracted and stored? or how do i manually store a vector with the amount of features that i need? For instance a feature vector of size 9

I'm trying to separate my features this way:

X_train, X_test, y_train, y_test = train_test_split(
    image_dataset.data, image_dataset.target, test_size=0.3,random_state=109)

model = ExtraTreesClassifier(n_estimators=10)
model.fit(X_train, y_train)
print(model.feature_importances_)

Though, my output is:

[0. 0. 0. ... 0. 0. 0.]

for SVM classification, im trying to use OneVsRestClassifier

model_to_set = OneVsRestClassifier(SVC(kernel="poly"))

parameters = {
    "estimator__C": [1,2,4,8],
    "estimator__kernel": ["poly", "rbf"],
    "estimator__degree":[1, 2, 3, 4],
}

model_tunning = GridSearchCV(model_to_set, param_grid=parameters)
model_tunning

model_tunning.fit(X_train, y_train)

prediction = model_tunning.best_estimator_.predict(X_test)

Then, once i call prediction, i get:

Out[29]:
array([1, 0, 4, 2, 1, 3, 3, 0, 1, 1, 3, 4, 1, 1, 0, 3, 2, 2, 2, 0, 4, 2,
       2, 4])

Solution

So you've got two arrays of image information (one unprocessed, the other resized and flattened) as well as a list of corresponding class values (which we usually call labels). There are currently 2 things not quite right with the setup, however:

1) What's missing here are multiple features - these might include specific arrays from data associated with feature extraction from morphological/computer vision processes of your images, or they may be ancillary data like a list of preferences, behaviors, purchases. Basically, anything that can act as an array in either a numerical or categorical format. Technically speaking, your resized images are a second feature, but I don't think this will add much if any improvement in model performance.

2) target_names=category in your function return will store the last iteration pf category in CATEGORIES. I don't know if this is what you want.

Going back to your table, N would refer to the number of images in the dataset, and F would be the number of corresponding feature arrays associated with that image. By way of example, let's say we have fifty individual wines and five features (colour, taste, alcohol content, pH, optical density). N of 5 would be five of those wines, and F of 2 would be, say, colour, taste.

If I had to guess at what your features would be, they would in fact be a single feature - the image data itself. Looking at your data structure, every label/category you have will have multiple individuals (people) each with multiple examples of images of that person. Note that multiple individuals are not separate features - the way you're structuring the data, the individuals are grouped together under a single category.

So, where to from here? Without knowing what paper you're reading it's hard to suggest what to do, but I would go back and see if you can perhaps provide us with more information about the problem.