I would like to create an AI to detect text in an image containing a string of one or more characters from a list of 100 possible characters. I would like the AI to output a string containing what it predicts the text is, as well as a confidence value for each character in it.
My problem is that I can't figure out how to have a variable number of outputs.
I have figured out how to detect a single character by finding a confidence value for each possible character, but I would like for it to find this for each character in an image containing multiple images.
The only solution I can think of is to have it return a binary number for each letter, the same length as the string. Each 1
would represent the character being present in that position. However, I don't know how to get a confidence value from this.
What I have:
This is the AI currently have. It takes an image containing a single character and outputs a confidence value for each possible character.
Example:
Input:
An image containing the letter B
Output:
Letter | Confidence |
---|---|
A | 0.05 |
B | 0.80 |
C | 0.05 |
etc...
What I want:
I would like it to take an image containing multiple characters and output a confidence value for each possible character for each character in the image.
Example 1: (3 character string)
Input:
An image containing the string ABC
Output:
Letter | Character 1 | Character 2 | Character 2 |
---|---|---|---|
A | 0.80 | 0.05 | 0.05 |
B | 0.05 | 0.80 | 0.05 |
C | 0.05 | 0.05 | 0.80 |
etc...
Example 2: (2 Character string)
Input:
An image containing the string BA
Output:
Letter | Character 1 | Character 2 |
---|---|---|
A | 0.05 | 0.85 |
B | 0.80 | 0.05 |
C | 0.05 | 0.05 |
etc...
Notice how the length of the output array changes depending on the input.
This is my first time creating an AI in python, so I haven't really used sci-kit learn or Keras. I'm fine with using either, however my current solution uses Keras.
You need to separate your problem into two steps:
Step 1: Find all characters.
Step 2: Crop each character from step one and predict which kind of character it is.
Note: There are approaches which combines everything into one step (YOLO architecture can possibly do that, however, since step 2 is already solved and you are a beginner, it might be easier to understand and debug if you keep the steps separate for the moment).
You could use a very simple U-net architecture, to find the position of characters. Here it is explained for keras. The result will be a heatmap, which has high values for areas containing a character. These heatmaps can be viewed as images and allow easy interpreation. You can then find peaks (=the center of a character), cut the area around the peak (e.g. 64x64px) and feed each peak into your step-2-network. Peaks can be easily found with this procedure. The skimage
function peak_local_max
allows to define thresholds and the minimum distance to the border and to other peaks. Note, your step-2-network expects images of the same size (e.g. 64x64px), so when cropping characters close to the border you will need to pad with zeros to give it the needed size.
edit The training data for step-1-network is a black images with white circles at the centers of your characters. The training data for step-2-network is a centered, fixed-sized character image with a one-hot-encoded label of the character it is. Here it might be useful to add an additional class for "Not a character" to correct mistakes that step-1-network makes.