tensorflow keras deep-learning yolo object-recognition

Is it possible to train YOLO (any version) for a single class where the image has text data. (find region of equations)

I am wondering if YOLO (any version, specially the one with accuracy, not speed) can be trained on the text data. What I am trying to do is to find the Region in the text image where any equation is present.

For example, I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.

I am asking this questions because : First of all I have not found a place where the YOLO is used for text data. Secondly, how can we customise for low resolution unlike the (416,416) as all the images are either cropped or horizontal mostly in (W=2H) format.

I have implemented the YOLO-V3 version for text data but using OpenCv which is basically for CPU. I want to train the model from scratch.

Please help. Any of the Keras, Tensorflow or PyTorch would do.

Here is the code I used for implementing in OpenCv.

net = cv2.dnn.readNet(PATH+"yolov3.weights", PATH+"yolov3.cfg") # build the model. NOTE: This will only use CPU
layer_names = net.getLayerNames() # get all the layer names from the network 254 layers in the network
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] # output layer is the 
# 3 output layers in otal


blob = cv2.dnn.blobFromImage(image=img, scalefactor=0.00392, size=(416,416), mean=(0, 0, 0), swapRB=True,)
#  output as numpy array of (1,3,416,416). If you need to change the shape, change it in the config file too
# swap BGR to RGB, scale it to a threshold, resize, subtract it from the mean of 0 for all the RGB values

net.setInput(blob) 

outs = net.forward(output_layers) # list of 3 elements for each channel

class_ids = [] # id of classes
confidences = [] # to store all the confidence score of objects present in bounding boxes if 0, no object is present
boxes = [] # to store all the boxes

for out in outs: # get all channels one by one
    for detection in out: # get detection one by one
        scores = detection[5:] # prob of 80 elements if the object(s) is/are inside the box and if yes, with what prob
        
        class_id = np.argmax(scores) # Which class is dominating inside the list
        confidence = scores[class_id]
        if confidence > 0.1: # consider only those boxes which have a prob of having an object > 0.55
            
            # grid coordinates
            center_x = int(detection[0] * width) # centre X of grid
            center_y = int(detection[1] * height) # Center Y of grid
            w = int(detection[2] * width) # width
            h = int(detection[3] * height) # height
            
            # Rectangle coordinates
            x = int(center_x - w / 2)
            y = int(center_y - h / 2)
            
            boxes.append([x, y, w, h]) # get all the bounding boxes
            confidences.append(float(confidence)) # get all the confidence score
            class_ids.append(class_id) # get all the clas ids

Solution

Being an object detector Yolo can be used for specific text detection only, not for detecting any text that might be present in the image.

For example Yolo can be trained to do text based logo detection like this:

I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.

Your problem statement talks about detecting any equation (math formula) that's present in the image so it can't be done using Yolo alone. I think mathpix is similar to your use-case. They will be using OCR (Optical Character Recognition) system trained and fine tuned towards their use-case.

Eventually to do something like mathpix, OCR system customised for your use case is what you need. There won't be any ready ready made solution out there for this. You'll have to build one.

Proposed Methods:

Note: Tesseract as it is can't be used because it is a pre-trained model trained for reading any character. You can refer 2nd paper to train tesseract towards fitting your use case.

To get some idea about OCR, you can read about it here.

EDIT:

So idea is to build your own OCR to detect something that constitutes equation/math formula rather than detecting every character. You need to have data set where equations are marked. Basically you look for region with math symbols(say summation, integration etc.).

Some Tutorials to train your own OCR:

So idea is that you follow these tutorials to get to know how to train and build your OCR for any use case and then you read research papers I mentioned above and also some of the basic ideas I gave above to build OCR towards your use case.