Genetic algorithm for feature selection

I am doing project on heart disease prediction system. here am using "Cleveland Heart Disease Dataset" which contains 13 attributes

Sex
Chest Pain Type
Fasting Blood Sugar
Restecg – resting electrographic results
Exang – exercise induced angina
Slope - the slope of the peak exercise ST segment
CA – number of major vessels colored by fluorscopy
Thal
Trest Blood Pressure
Serum Cholesterol
Thalach – maximum heart rate achieved
Oldpeak – ST depression induced by exercise relative to rest
Age

I found a paper where they applied genetic algorithm for this purpose and selected the following attributes

Type - Chest Pain Type
Rbp - Resting blood pressure
Eia - Exercise induced angina
Oldpk - Old peak
Vsl - No. of vessels colored
Thal -Maximum heart rate achieved)

However, they didn't mention about the criteria they used to find the fittest attributes(fitness function). Since I am new to this concept I don't have any idea of how to perform the task. Can anyone help me?

Solution

Defining the population and its representation

The candidates (population of GA) are the different subsets of attributes. Each subset can be a good set of attributes related to hearth disease or not.

So I understand you have data with different measures for the attributes and an indicator of the measured person having hearth disease or not.

You can easily represent a subset of attributes using a bit for each atribute. So 10000000000000 will be the subset with only the first attribute. 11000... only the two first... and so on.

Find the fitness function

How to say if a candidate (subset of attributes) is a good or bad indicator for hearth disease. I would say it's good if its directly correlated with the disease. So for all the patients with high numbers on that indicators they have disease, and for all the patients with low numbers the don't have the disease.

TODO: find a correlation measure... :) (I'll edit the answer)

A subset with more indicators than necesary is bad. So you have to score worse if an attribute from the subset is NOT correlated.

TODO: find a way to introduce this.

Two directions

Also, I will have into account the two directions. By example an attribute can be related with hearth disease if it has a low number. So I will use 26 bits. Two bits for eachs indicator. One using the attribute value, and other the negative one.

Finding a fitness measure

With the statistical data you could tell if an arbitrary set of attributes is good for finding hearth disease or not.

Each patient will be first, second, and so on according to each attribute. By example blood pressure. The one with less pressure will be the first, the one with more pressure will be the last.

So if blood pressure is highly related, those with high values will have disease while those with low pressure will not have.

So a good score for a set of attributes is how many correct diagnoses you could do based on data you have. If you have attributes A and B, their score as good indicators will increase with the number of patients with high numbers and hearth disease (related), and will decrease with the number of patients with low numbers and hearth disease (unrelated or contradictory).

For an only attribute

I can order patients based on that attribute. Then I can see which of them have disease. If those with higher numbers (to the right of the ordering) have disease, then its related. Otherwise not.

If I obtain:

ND ND ND ND ND D D D D D D

ND = no disease
D = disease

It's very very related.

So the score for me will be how ordered is the ND/D value, after ordering the patients by their value on that attribute.

For a set of attributes

Of course you have to give a score for a set of attributes (let's say, the first three attributes of the list). So I should first order patients by each one of them:

Ordered by -> Attr1, Attr2, Attr3

Patient1       1st    3rd    10th
Patient2       2nd    11th   2nd
Patient3       6th    1st    3rd

And then sum the positions for each patient:

Ordered by -> Attr1, Attr2, Attr3

Patient1       1st    3rd    10th -> 1+3+10 = 14
Patient2       2nd    11th   2nd -> 2 + 11 + 2 = 15
Patient3       6th    1st    3rd -> 6+1+3 = 10

And then order the patients by that sum.

P3, P1, P2

Then if their disease status is highly ordered (those with disease are on the right), the the score is high.

By example:

ND ND D -> only patient 2 has disease, highly correlated
D D ND -> patients 3 and 1 has disease, doesn't seem correlated (in fact, it seems contradictory)

So the last part for defining an scoring method is find a way to say if a sequence of bits is ordered or not:

ND ND ND ND D D D D D D -> high score
D ND D ND D ND D ND D ND -> low score

Hope it helps! :)