ocr google-cloud-vision google-vision text-segmentation

Understanding DetectedBreak in google OCR full text annotations

I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block,Paragraph,Word and Symbol hierarchy.

However, when converting symbols to word text and word to line text, I need to understand the DetectedBreak property.

I went through This documentation.But I did not understand few of the them.

Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK and SPACE.

EOL_SURE_SPACE
HYPHEN
LINE_BREAK
SPACE
SURE_SPACE
UNKNOWN

Can they be replaced by either a newline char or space ?

Solution

The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK and SPACE), along with the word immediately preceding them to enable comparison.

import sys
import os
from google.cloud import storage
from google.cloud import vision

def detect_breaks(gcs_image):

    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    image_source = vision.types.ImageSource(
        image_uri=gcs_image)

    image = vision.types.Image(
        source=image_source)

    request = vision.types.AnnotateImageRequest(
        features=[feature], image=image)

    annotation = client.annotate_image(request).full_text_annotation

    breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
    word_text = ""
    for page in annotation.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        word_text += symbol.text
                        if symbol.property.detected_break.type:
                            if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
                                word_text = ""
                            else:
                                print word_text,symbol.property.detected_break
                                word_text = ""

if __name__ == '__main__':
    detect_breaks(sys.argv[1])