image-processing crop face-detection feature-detection

State of the Art solutions for automated thumbnail cropping using detection of faces and other features

I'm looking for a state of the art method to crop images to thumbnails while keeping all relevant features of the images. The images are stills from TV shows and movies. They are large (more than 1000px), sharp and usually very well balanced (hue, saturation). I doesn't matter if this happened in realtime or not.

Solution

This question is quite ill-posed in the sense that it depends entirely on what you mean by "all relevant features".

I assume relevant features in a TV show or movies might be 1) Faces, 2) People, 3) Logos, or 4) Anything a human might find interesting/salient.

1) Faces. You could run a face detector such as the one built in to OpenCV. This uses the Viola/Jones Haar cascade technique to find faces in an image and will return a set of boxes around those faces. You can then crop the frame to a region containing only those boxes. It's not state of the art but it's the most common face detector used e.g. in camera hardware for face finders etc.

2) People. To detect people you could use a standard pedestrian detector (e.g. Dalal and Triggs HOG/SVM method, see their CVPR 2005 paper). This is not state of the art but will probably do a reasonable job, and there are plenty of works and implementations derived from that kind of framework available on the web e.g. search for INRIA pedestrian detector.

An alternative would be to use the Oxford VGG's upper body/torso detector which is also a reasonable predictor for people in an image and was trained I believe on Buffy the Vampire slayer TV show.

3) Logos. Use a SIFT detector and Bag of Visual Words framework with SVM to find these robustly. You can google various papers from Andrew Zisserman's group (Oxford) or Gabriela Csurka's group (XRCE Grenoble) e.g. "Video Google" and so on to find out more about these methods and it is fairly trivial to implement them in OpenCV with its built in feature detectors. A Bag of Visual Words approach would be suffient here but a Fisher Vector based approach would probably be considered closer to state of the art.

4) "Anything salient". For some decades computer vision researchers have attempted to design generic "anything interesting" detectors for general imagery but in my opinion no-one has approached a usable solution for the kind of context (any TV show or Movie) you state. If you want to try something mid-range (again not state of the art but something likely to have code available freely on the www) you could google the Itti Koch method.