Search code examples
javagenetic-algorithm

Genetic algorithm for feature selection


I am doing project on heart disease prediction system. here am using "Cleveland Heart Disease Dataset" which contains 13 attributes

  1. Sex
  2. Chest Pain Type
  3. Fasting Blood Sugar
  4. Restecg – resting electrographic results
  5. Exang – exercise induced angina
  6. Slope - the slope of the peak exercise ST segment
  7. CA – number of major vessels colored by fluorscopy
  8. Thal
  9. Trest Blood Pressure
  10. Serum Cholesterol
  11. Thalach – maximum heart rate achieved
  12. Oldpeak – ST depression induced by exercise relative to rest
  13. Age

I found a paper where they applied genetic algorithm for this purpose and selected the following attributes

  1. Type - Chest Pain Type
  2. Rbp - Resting blood pressure
  3. Eia - Exercise induced angina
  4. Oldpk - Old peak
  5. Vsl - No. of vessels colored
  6. Thal -Maximum heart rate achieved)

However, they didn't mention about the criteria they used to find the fittest attributes(fitness function). Since I am new to this concept I don't have any idea of how to perform the task. Can anyone help me?


Solution

  • Defining the population and its representation

    The candidates (population of GA) are the different subsets of attributes. Each subset can be a good set of attributes related to hearth disease or not.

    So I understand you have data with different measures for the attributes and an indicator of the measured person having hearth disease or not.

    You can easily represent a subset of attributes using a bit for each atribute. So 10000000000000 will be the subset with only the first attribute. 11000... only the two first... and so on.

    Find the fitness function

    How to say if a candidate (subset of attributes) is a good or bad indicator for hearth disease. I would say it's good if its directly correlated with the disease. So for all the patients with high numbers on that indicators they have disease, and for all the patients with low numbers the don't have the disease.

    TODO: find a correlation measure... :) (I'll edit the answer)
    

    A subset with more indicators than necesary is bad. So you have to score worse if an attribute from the subset is NOT correlated.

    TODO: find a way to introduce this.
    

    Two directions

    Also, I will have into account the two directions. By example an attribute can be related with hearth disease if it has a low number. So I will use 26 bits. Two bits for eachs indicator. One using the attribute value, and other the negative one.

    Finding a fitness measure

    With the statistical data you could tell if an arbitrary set of attributes is good for finding hearth disease or not.

    Each patient will be first, second, and so on according to each attribute. By example blood pressure. The one with less pressure will be the first, the one with more pressure will be the last.

    So if blood pressure is highly related, those with high values will have disease while those with low pressure will not have.

    So a good score for a set of attributes is how many correct diagnoses you could do based on data you have. If you have attributes A and B, their score as good indicators will increase with the number of patients with high numbers and hearth disease (related), and will decrease with the number of patients with low numbers and hearth disease (unrelated or contradictory).

    For an only attribute

    I can order patients based on that attribute. Then I can see which of them have disease. If those with higher numbers (to the right of the ordering) have disease, then its related. Otherwise not.

    If I obtain:

    ND ND ND ND ND D D D D D D
    
    ND = no disease
    D = disease
    

    It's very very related.

    So the score for me will be how ordered is the ND/D value, after ordering the patients by their value on that attribute.

    For a set of attributes

    Of course you have to give a score for a set of attributes (let's say, the first three attributes of the list). So I should first order patients by each one of them:

    Ordered by -> Attr1, Attr2, Attr3
    
    Patient1       1st    3rd    10th
    Patient2       2nd    11th   2nd
    Patient3       6th    1st    3rd
    

    And then sum the positions for each patient:

    Ordered by -> Attr1, Attr2, Attr3
    
    Patient1       1st    3rd    10th -> 1+3+10 = 14
    Patient2       2nd    11th   2nd -> 2 + 11 + 2 = 15
    Patient3       6th    1st    3rd -> 6+1+3 = 10
    

    And then order the patients by that sum.

    P3, P1, P2
    

    Then if their disease status is highly ordered (those with disease are on the right), the the score is high.

    By example:

    ND ND D -> only patient 2 has disease, highly correlated
    D D ND -> patients 3 and 1 has disease, doesn't seem correlated (in fact, it seems contradictory)
    

    So the last part for defining an scoring method is find a way to say if a sequence of bits is ordered or not:

    ND ND ND ND D D D D D D -> high score
    D ND D ND D ND D ND D ND -> low score
    

    Hope it helps! :)