python image-processing pyspark conv-neural-network dimensionality-reduction

Dimensionality reduction - Pyspark

My objective is to find visual similarity between various Double Byte characters when written in a particular font. For instance,

I want to ascertain whether 伊 looks more similar to 達 or more similar to 市. This exercise has to be done for 13,108 characters.

To solve for it we converted all these characters into grey-scale images using the draw library in python. Then we passed all the characters through VGG-16 (CNN Layer) to get a feature set for them. The feature set output for VGG-16 (CNN Layer) has 512x7x7 (25088) elements. We collated all these into one file. Now we have around 13,108 rows and 25,088 columns and my aim is to run clustering on them to find optical similarity among all the characters. In order to do the same I have to reduce the number of variables (Columns).

What should be the most optimal way to do the same and around how many variables (Columns) should I expect to retain for the final model?

Solution

I suggest you to use an autoencoder neural network, which the objetive is to reconstruct the input in the output. This network has encode layers to reduce the dimensionality, a bottleneck layer and decode layers to reconstruct the input given the bottleneck layer. See below an image of the autoencoder neural network:

You can use the bottleneck layer as your new variables (Columns) and then clustering on them to find optical similarity among all the characters. A big advantage of this approach is that, different then other dimensionality reduction methods like PCA, the autoencoder perform non linear operations, which leads to better results.