I have a bit of a problem with my project for the university.
I have to implement document classification using genetic algorithm.
I've had a look at this example and (lets say) understood the principles of the genetic algorithms but I'm not sure how they can be implemented in document classification. Can't figure out the fitness function.
Here is what I've managed to think of so far (Its probably totally wrong...)
Accept that I have the categories and each category is described by some keywords.
Split the file to words.
Create first population from arrays (100 arrays for example but it will depends on the size of the file) filled with random words from the file.
1:
Choose the best category for each child in the population (by counting the keywords in it).
Crossover each 2 children in the population (new array containing half of each children) - "crossover"
Fill the rest of the children left from the crossover with random not used words from the file - "evolution??"
Replace random words in random child from the new population with random word from the file (used or not) - "mutation"
Copy the best results to the new population.
Go to 1 until some population limit is reached or some category is found enough times
I'm not sure if this is correct and will be happy to have some advices, guys.
Much appreciate it!
Ivane, in order to properly apply GA's to document classification:
So the steps that you've described are on the right track, but I'll give you some improvements:
So what you want to do is:
prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;
while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
prevValidationFitness = currentValidationFitness;
// Randomly generate a population of GAs
population[] = randomlyGenerateGAs();
// Train your population on the training data set
bestGA = Train(population);
// Get the validation fitness fitness of the best GA
currentValidationFitness = Validate(bestGA);
// Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
selection[] = makeSelection(population);
// Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
population = mate(selection);
}
Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA:
category = bestGA.Classify(document);
So this is not the end-all-be-all solution, but it should give you a decent start. Pozdravi, Kiril