deep-learning computer-vision object-detection

Is it possible to perform Object Detection without proper training data

For my task, I am given a series of frames extracted from a Tom and Jerry video. I need to detect the objects in a frame(in my case, the objects are tom and jerry) and their location. Since my dataset is different from the classes in ImageNet, I am stuck with no training data.

I have searched extensively and there seem to be some tools where I need to manually crop the location of images, is there any way to do this without such manual work ?

Any suggestions would be really helpful, thanks a lot !

Solution

is there any way to do this without such manual work ?

Welcome to the current state of machine learning, driven by data-hungry networks and much labor and work on creating datasets side :) Labels are here and will stay for a while, to tell your network (via loss function) what you want to do. but but.. you are not in that bad situation at all, because you can go for the pre-trained net and just fine-tune into your lovely Jerry and Tom (acquiring training data will be 1-2h itself). So what is this fine-tuning and how does it work? Let's say you are taking a pre-trained net trained on Imagenet and this net can perform reasonably well on classes defined in Imagenet. It will be your starting point. This network already has learned quite abstract features about all these objects from ImageNet, that's why the network is capable of transfer learning with a reasonably small amount of new class samples. Now when you add Tom and Jerry to the network output and fine-tune it on a small amount of data (20-100 samples) it will perform not that bad (I guess acc will be somewhere in 65-85%). So here is what I suggest:

google some pre-trained net easy to interact. I found this. See chapter 4. Transfer Learning with Your Own Image Dataset.
pick some labeling tool.
label 20-100 Toms, Jerries with bounding box. For a small dataset like this, divide it to ./train (80%) and ./test (20%). Try to catch different poses, different backgrounds, frames distinct from each other. Go for some augmentation.
Remove last network layer and add layer for 2 new outputs, Tom and Jerry.
train it (fine-tune it), check accuracy on your test set.
have fun! Train it again with more data.

"Is it possible to perform Object Detection without proper training data?"

It's kinda is, but I can't imagine anything simpler than fine-tuning. We can talk here about:

A. non-machine learning approaches: which is computer vision + hand-crafting features + manually defining parametes and using it as a detector, but in your case, it is rather not the way you want to go; however some box sliding and manually color histogram thresholding may work for Tom and Jerry (this thresholding parameter naturally might be subject to train). This is quite often more work to do than proposed fine-tuning. Sometimes it is a way to label thousands of samples that way, then correct labels, then train more powerful detectors. There are numerous tasks that this approach is enough and the benefit can be lightweight and speed.

B. machine learning approaches which deal with no proper training data. Or maybe which deal with a small amount of data as humans do. This is mainly emerging filed, currently under active R&D and few of my favorites are:

fine-tuning pre-trained nets. Hey, we are using this because it is so simple!
one-shot approaches, like triplet-loss+deep-metrics
memory augmented neural networks used in one/few shot context
unsupervised, semi-supervised approaches
bio-plausible nets, including no-backprop approach with only last layer tuned via supervision