Hi and thanks for looking!
For the sake of clarity, a third-party .NET library is just fine. Preferably an open-source or free one. The solution need not be native .NET.
I am working on an enterprise web application for which the client has given us thousands of pages of content in MS Word documents that we have to parse, extract data, and send to the content database.
Within these docs are various embedded images representing a larger original image in a separate folder.
The client did not provide any paths to the original source image, so when we see content with an embedded image in the MS Word doc, we have to go through several "assets" folders and look for the corresponding image which is extraordinarily time consuming.
We are already using DocX to parse the documents, so you can assume that we have a list of bitmap images to loop through that we have pulled from the document.
Given a list of bitmaps that we just extracted from the document, how do we search a different folder containing hundreds of images, for the matching image, and then return the file path to it?
TinEye.com does this over the web. I am wondering if, using System.Drawing or something, we can do it on a PC with C#.
Thanks!
Matt
Hate to propose an answer to my own question, but I think I might be on to something here. Here is heuristic/pseudo code for a C# forms app--your thoughts are appreciated:
<Image> <Path>C:\SomePath</Path> <EncodedString>[Some Base64 String]<Encoded String> </Image>
Now we have an XML file containing all original images, in Base64 form, along with their file path.
foreach
store the XML node Id (or file path) and Levenshtein Distance as a key value pair in an object.foreach
stops if a certain original image has an acceptably low LD score when compared to the image extracted from the document.Since this is a one-off task, I don't need instant performance. So, I could run this tonight before leaving the office and, hopefully, come back tomorrow to a list of paths connecting the original images to the ones embedded in the docs.
The heuristic above worked beautifully! I ended up using the Sift library to efficiently calculate distances between Base64 strings. Specifically, I used their FastDistance() method. Having 100% accuracy on finding the images I need, even if the angle from which the photo was taken is slightly different.