Search code examples
c#windowsimage-processingms-officesystem.drawing

Using C#, how do I search for images in a Windows file system like TinEye.com does on the web?


Hi and thanks for looking!

Update

For the sake of clarity, a third-party .NET library is just fine. Preferably an open-source or free one. The solution need not be native .NET.

Background

I am working on an enterprise web application for which the client has given us thousands of pages of content in MS Word documents that we have to parse, extract data, and send to the content database.

Within these docs are various embedded images representing a larger original image in a separate folder.

The client did not provide any paths to the original source image, so when we see content with an embedded image in the MS Word doc, we have to go through several "assets" folders and look for the corresponding image which is extraordinarily time consuming.

We are already using DocX to parse the documents, so you can assume that we have a list of bitmap images to loop through that we have pulled from the document.

Question

Given a list of bitmaps that we just extracted from the document, how do we search a different folder containing hundreds of images, for the matching image, and then return the file path to it?

TinEye.com does this over the web. I am wondering if, using System.Drawing or something, we can do it on a PC with C#.

Thanks!

Matt


Solution

  • Hate to propose an answer to my own question, but I think I might be on to something here. Here is heuristic/pseudo code for a C# forms app--your thoughts are appreciated:

    Part 1

    1. Using System.IO, traverse the "assets" folders and get all images.
    2. For each image, Base64 encode it.
    3. Take the resulting string and place in an XML file:
    <Image>
         <Path>C:\SomePath</Path>
         <EncodedString>[Some Base64 String]<Encoded String>
    </Image>
    

    Now we have an XML file containing all original images, in Base64 form, along with their file path.

    Part 2

    1. Using DocX, extract all images from MS Word Doc.
    2. For each image, use Linq-to-Xml to search for an exact match in the XML file from Part 1.
    3. If there are no exact matches, start iterating the XML file and computing the Levenshtein distance.
    4. While in the foreach store the XML node Id (or file path) and Levenshtein Distance as a key value pair in an object.
    5. Take the k/v pair with the lowest LD score and return the file path.
    6. For performance, set tolerance so that the foreach stops if a certain original image has an acceptably low LD score when compared to the image extracted from the document.

    Since this is a one-off task, I don't need instant performance. So, I could run this tonight before leaving the office and, hopefully, come back tomorrow to a list of paths connecting the original images to the ones embedded in the docs.

    UPDATE

    The heuristic above worked beautifully! I ended up using the Sift library to efficiently calculate distances between Base64 strings. Specifically, I used their FastDistance() method. Having 100% accuracy on finding the images I need, even if the angle from which the photo was taken is slightly different.