I have very many folders with a large number of image files in there. Occasionally a scanned document image ends up in a folder by accident and short of someone visually scanning the folder, these remain undetected but could cause problems if published to the wrong location.
Since they could have been scanned as any file type and sizes are broadly in the range of the genuine images, they are very hard to detect from metadata.
Does anyone know of a way to detect a scanned document from a genuine image - either a tool or a programmatic way?
I would recommend taking a look at the Accord Framework: http://accord-framework.net/. Check out the Computer Vision features. I think it should be up to the task you are describing, plus it is a fun new area to learn. Good luck.