artificial-intelligence probability genetic-algorithm genetic-programming

Using genetic programming to estimate probability

I would like to use a genetic program (gp) to estimate the probability of an 'outcome' from an 'event'. To train the nn I am using a genetic algorithm.

So, in my database I have many events, with each event containing many possible outcomes.

I will give the gp a set of input variables that relate to each outcome in each event.

My questions is - what should the fitness function be in the gp be ????

For instance, right now I am giving the gp a set of input data (outcome input variables), and a set of target data (1 if outcome DID occur, 0 if outcome DIDN'T occur, with the fitness function being the mean squared error of the outputs and targets). I then take the sum of each output for each outcome, and divide each output by the sum (to give the probability). However, I know for sure that this is not the right way to be doing this.

For clarity, this is how I am CURRENTLY doing this:

I would like to estimate the probability of 5 different outcomes occurring in an event:

Outcome 1 - inputs = [0.1, 0.2, 0.1, 0.4] 
Outcome 1 - inputs = [0.1, 0.3, 0.1, 0.3] 
Outcome 1 - inputs = [0.5, 0.6, 0.2, 0.1] 
Outcome 1 - inputs = [0.9, 0.2, 0.1, 0.3] 
Outcome 1 - inputs = [0.9, 0.2, 0.9, 0.2]

I will then calculate the gp output for each input:

Outcome 1 - output = 0.1 
Outcome 1 - output = 0.7 
Outcome 1 - output = 0.2 
Outcome 1 - output = 0.4 
Outcome 1 - output = 0.4

The sum of the outputs for each outcome in this event would be: 1.80. I would then calculate the 'probability' of each outcome by dividing the output by the sum:

Outcome 1 - p = 0.055 
Outcome 1 - p = 0.388 
Outcome 1 - p = 0.111 
Outcome 1 - p = 0.222 
Outcome 1 - p = 0.222

Before you start - I know that these aren't real probabilities, and that this approach does not work !! I just put this here to help you understand what I am trying to achieve.

Can anyone give me some pointers on how I can estimate the probability of each outcome ? (also, please note my maths is not great)

Many thanks

Solution

I understand the first part of your question: What you described is a classification problem. You're learning if your inputs relate to whether an outcome was observed (1) or not (0).

There are difficulties with the second part though. If I understand you correctly you take the raw GP output for a certain row of inputs (e.g. 0.7) and treat it as a probability. You said this doesn't work, obviously. In GP you can do classification by introducing a threshold value that splits your classes. If it's bigger than say 0.3 the outcome should be 1 if it's smaller it should be 0. This threshold isn't necessarily 0.5 (again it's just a number, not a probability).

I think if you want to obtain a probability you should attempt to learn multiple models that all explain your classification problem well. I don't expect you have a perfect model that explains your data perfectly, respectively if you have you wouldn't want a probability anyway. You can bag these models together (create an ensemble) and for each outcome you can observe how many models predicted 1 and how many models predicted 0. The amount of models that predicted 1 divided by the number of models could then be interpreted as a probability that this outcome will be observed. If the models are all equally good then you can forget weighing between them, if they're different in quality of course you could factor these into your decision. Models with less quality on their training set are less likely to contribute to a good estimate.

So in summary you should attempt to apply GP e.g. 10 times and then use all 10 models on the training set to calculate their estimate (0 or 1). However, don't force yourself to GP only, there are many classification algorithms that can give good results.

As a sidenote, I'm part of the development team of a software called HeuristicLab which runs under Windows and with which you can run GP and create such ensembles. The software is open source.