image-processing deep-learning word2vec word-embedding

Is there an equivalent of word2vec for images?

I'm wondering if it would be possible to create dense vector representations for an image, similar to how you might create a word embedding with an algorithm like Word2Vec?

I understand that there are some big differences between text and image data – specifically the fact that word2vec uses the word's context to train – but I'm hoping to find a similar counterpart for images.

If a simplistic example of w2v (from Allison Parrish's GitHub Gist) is:

            | cuteness (0-100) | size (0-100) |
|–––––––––––|––––––––––––––––––|––––––––––––––|
| kitten    |        95        |     15       |
| tarantula |         8        |      3       |
| panda     |        75        |     40       |
| mosquito  |         1        |      1       |
| elephant  |        65        |     90       |

And another example being king - man + woman = queen

Is there some analog (or way of creating some type of analog) for images where you might get something generally along these lines (with some made-up numbers):

                             | amount of people | abstract-ness |
                             | in image (0-100) |    (0-100)    |
|––––––––––––––––––––––––––––|––––––––––––––––––|–––––––––––––––|
| Starry Night               |         0        |       75      |
| Mona Lisa                  |         1        |        9      |
| American Gothic            |         2        |        7      |
| Garden of Earthly Delights |        80        |       50      |
| Les Demoiselles d'Avignon  |         5        |       87      |

(and just to clarify, know the actual vectors created by an algorithm like Word2Vec wouldn't cleanly fit into human-interpretable categories, but I just wanted to give an analogy to the Word2Vec example.)

or (starry night) - (landscape) + (man) = (van Gogh self portrait) or = (abstract self portrait) or something generally along those lines.

Those might not be the best examples but just to recap, I'm looking for some sort of algorithm for creating an abstract n-dimensional learned representation for an image that can be grouped or compared with vectors representing other images.

Thanks for your help!

Solution

Absolutely! But...

Such models tend to need significantly larger and deeper neural-networks to learn the representations.

Word2vec uses a very-shallow network, and performs a simple prediction of neighboring-words, often from a tightly limited vocabulary, as the training-goal which (as a beneficial side-effect) throws off compact vectors for each word.

Image-centric algorithms instead try to solve labeling/classification tasks, or regenerate original images under compressed-representation (or adversarial-classifier) constraints. They use 'convolutions' or other multi-layer constructs to intepret the much-larger space of possible pixel values, and some of the interim neural-network layers can be interpretable as compact vectors for the input images.

Note that even in textual word2vec, the individual "dense embedding" dimensions, learned in an unsupervised fashion, don't have neat human-interpretability (like "bigness", "cuteness", etc.). Often, certain directions/neighborhood of the high-dimensional space are vaguely-interpretable, but they're not precise nor aligned exactly with major dimension axes.

Similarly, any compact representations from deep-neural-network image-modeling won't inherently have individual dimensions with clear meanings (unless specific extra constraints with those goals were designed-in) – but again, certain directions/neighborhoods of the high-dimensional space tend to be meaningful ("crowds", "a car", "smiles", etc).

A nice overview of some key papers in deep-learning based image-analysis – the kinds of algorithms which throw off compact & meaningful vector summaries of images – that I just found is at:

https://adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html