I am doing project on heart disease prediction system. here am using "Cleveland Heart Disease Dataset" which contains 13 attributes
I found a paper where they applied genetic algorithm for this purpose and selected the following attributes
However, they didn't mention about the criteria they used to find the fittest attributes(fitness function). Since I am new to this concept I don't have any idea of how to perform the task. Can anyone help me?
Defining the population and its representation
The candidates (population of GA) are the different subsets of attributes. Each subset can be a good set of attributes related to hearth disease or not.
So I understand you have data with different measures for the attributes and an indicator of the measured person having hearth disease or not.
You can easily represent a subset of attributes using a bit for each atribute. So 10000000000000 will be the subset with only the first attribute. 11000... only the two first... and so on.
Find the fitness function
How to say if a candidate (subset of attributes) is a good or bad indicator for hearth disease. I would say it's good if its directly correlated with the disease. So for all the patients with high numbers on that indicators they have disease, and for all the patients with low numbers the don't have the disease.
TODO: find a correlation measure... :) (I'll edit the answer)
A subset with more indicators than necesary is bad. So you have to score worse if an attribute from the subset is NOT correlated.
TODO: find a way to introduce this.
Two directions
Also, I will have into account the two directions. By example an attribute can be related with hearth disease if it has a low number. So I will use 26 bits. Two bits for eachs indicator. One using the attribute value, and other the negative one.
Finding a fitness measure
With the statistical data you could tell if an arbitrary set of attributes is good for finding hearth disease or not.
Each patient will be first, second, and so on according to each attribute. By example blood pressure. The one with less pressure will be the first, the one with more pressure will be the last.
So if blood pressure is highly related, those with high values will have disease while those with low pressure will not have.
So a good score for a set of attributes is how many correct diagnoses you could do based on data you have. If you have attributes A and B, their score as good indicators will increase with the number of patients with high numbers and hearth disease (related), and will decrease with the number of patients with low numbers and hearth disease (unrelated or contradictory).
For an only attribute
I can order patients based on that attribute. Then I can see which of them have disease. If those with higher numbers (to the right of the ordering) have disease, then its related. Otherwise not.
If I obtain:
ND ND ND ND ND D D D D D D
ND = no disease
D = disease
It's very very related.
So the score for me will be how ordered is the ND/D value, after ordering the patients by their value on that attribute.
For a set of attributes
Of course you have to give a score for a set of attributes (let's say, the first three attributes of the list). So I should first order patients by each one of them:
Ordered by -> Attr1, Attr2, Attr3
Patient1 1st 3rd 10th
Patient2 2nd 11th 2nd
Patient3 6th 1st 3rd
And then sum the positions for each patient:
Ordered by -> Attr1, Attr2, Attr3
Patient1 1st 3rd 10th -> 1+3+10 = 14
Patient2 2nd 11th 2nd -> 2 + 11 + 2 = 15
Patient3 6th 1st 3rd -> 6+1+3 = 10
And then order the patients by that sum.
P3, P1, P2
Then if their disease status is highly ordered (those with disease are on the right), the the score is high.
By example:
ND ND D -> only patient 2 has disease, highly correlated
D D ND -> patients 3 and 1 has disease, doesn't seem correlated (in fact, it seems contradictory)
So the last part for defining an scoring method is find a way to say if a sequence of bits is ordered or not:
ND ND ND ND D D D D D D -> high score
D ND D ND D ND D ND D ND -> low score
Hope it helps! :)