Search code examples
pythonmachine-learningcomputer-visionimage-recognitionviola-jones

AdaBoost and Viola Jones: What training set to use?


I have implemented my own version of Viola Jones face recognition algorithm that uses AdaBoost as a meta for building a classification committee. My aim is to build a classifier that can recognize whether there is a human face in an image. I am struggling to find adequate set of training data to try out the algorithm. In particular I dont know where to find a set of negative images (i.e. images that do not contain a face). For the positive dataset I was going to try the Labeled Faces in the Wild dataset link.

What would be a good negative dataset?


Solution

  • Some solutions that might work for your problem are:

    • After some looking around, this resource seems to have a non-faces dataset.

    • Another dataset you could consider is the Google "things" dataset, found here, (description).

    • Something different you might consider is building you own dataset. If you are going to use the LFW dataset with heavily constrained (cropped) images, you could take a database of zoomed out photos, with and without people, run a standard face detection algo on it to determine where the faces are, and then crop out face-sized sections, both in cases where there is a face and where there is no face in the cropped region. Some datasets, such as VGG face, have images of faces, with the bounding boxes for them given. You might consider using something like this.

    • You could also use any dataset that you know has no faces in it, so long as it depicts scenes that your algo might run into. For example, the CIFAR 10 and CIFAR 100 sets have great scenes of outdoors, including some closeups of animal faces, which might be good negatives for your algo. You can find it here. Another is the Image Net set.

    Care should be taken when selecting this dataset, as it can easily introduce bias if you're not careful.