Search code examples
ocrgoogle-cloud-visiongoogle-visiontext-segmentation

Understanding DetectedBreak in google OCR full text annotations


I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block,Paragraph,Word and Symbol hierarchy.

However, when converting symbols to word text and word to line text, I need to understand the DetectedBreak property.

I went through This documentation.But I did not understand few of the them.

Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK and SPACE.

  1. EOL_SURE_SPACE
  2. HYPHEN
  3. LINE_BREAK
  4. SPACE
  5. SURE_SPACE
  6. UNKNOWN

Can they be replaced by either a newline char or space ?


Solution

  • The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK and SPACE), along with the word immediately preceding them to enable comparison.

    import sys
    import os
    from google.cloud import storage
    from google.cloud import vision
    
    def detect_breaks(gcs_image):
    
        os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
        client = vision.ImageAnnotatorClient()
    
        feature = vision.types.Feature(
            type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
    
        image_source = vision.types.ImageSource(
            image_uri=gcs_image)
    
        image = vision.types.Image(
            source=image_source)
    
        request = vision.types.AnnotateImageRequest(
            features=[feature], image=image)
    
        annotation = client.annotate_image(request).full_text_annotation
    
        breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
        word_text = ""
        for page in annotation.pages:
            for block in page.blocks:
                for paragraph in block.paragraphs:
                    for word in paragraph.words:
                        for symbol in word.symbols:
                            word_text += symbol.text
                            if symbol.property.detected_break.type:
                                if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
                                    word_text = ""
                                else:
                                    print word_text,symbol.property.detected_break
                                    word_text = ""
    
    if __name__ == '__main__':
        detect_breaks(sys.argv[1])