Search code examples
algorithmgenetic-algorithmdocument-classification

Document clasification, using genetic algorithms


I have a bit of a problem with my project for the university.

I have to implement document classification using genetic algorithm.

I've had a look at this example and (lets say) understood the principles of the genetic algorithms but I'm not sure how they can be implemented in document classification. Can't figure out the fitness function.

Here is what I've managed to think of so far (Its probably totally wrong...)

Accept that I have the categories and each category is described by some keywords.
Split the file to words.
Create first population from arrays (100 arrays for example but it will depends on the size of the file) filled with random words from the file.
1:
Choose the best category for each child in the population (by counting the keywords in it).
Crossover each 2 children in the population (new array containing half of each children) - "crossover"
Fill the rest of the children left from the crossover with random not used words from the file - "evolution??"
Replace random words in random child from the new population with random word from the file (used or not) - "mutation"
Copy the best results to the new population.
Go to 1 until some population limit is reached or some category is found enough times

I'm not sure if this is correct and will be happy to have some advices, guys.
Much appreciate it!


Solution

  • Ivane, in order to properly apply GA's to document classification:

    1. You have to reduce the problem to a system of components that can be evolved.
    2. You can't do GA training for document classification on a single document.

    So the steps that you've described are on the right track, but I'll give you some improvements:

    • Have a sufficient amount of training data: you need a set of documents which are already classified and are diverse enough to cover the range of documents which you're likely to encounter.
    • Train your GA to correctly classify a subset of those documents, aka the Training Data Set.
    • At each generation, test your best specimen against a Validation Data Set and stop training if the validation accuracy starts to decrease.

    So what you want to do is:

    prevValidationFitness = default;
    currentValidationFitness = default;
    bestGA = default;
    
    while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
    {
        prevValidationFitness = currentValidationFitness;
    
        // Randomly generate a population of GAs
        population[] = randomlyGenerateGAs();
    
        // Train your population on the training data set
        bestGA = Train(population);
    
        // Get the validation fitness fitness of the best GA 
        currentValidationFitness = Validate(bestGA);
    
        // Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
        selection[] = makeSelection(population);
    
        // Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
        population = mate(selection);
    }
    

    Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA:

    category = bestGA.Classify(document);
    

    So this is not the end-all-be-all solution, but it should give you a decent start. Pozdravi, Kiril