I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block
,Paragraph
,Word
and Symbol
hierarchy.
However, when converting symbols
to word
text and word
to line
text, I need to understand the DetectedBreak property.
I went through This documentation.But I did not understand few of the them.
Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK
and SPACE
.
Can they be replaced by either a newline char or space ?
The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION
on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK
and SPACE
), along with the word immediately preceding them to enable comparison.
import sys
import os
from google.cloud import storage
from google.cloud import vision
def detect_breaks(gcs_image):
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
image_source = vision.types.ImageSource(
image_uri=gcs_image)
image = vision.types.Image(
source=image_source)
request = vision.types.AnnotateImageRequest(
features=[feature], image=image)
annotation = client.annotate_image(request).full_text_annotation
breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
word_text = ""
for page in annotation.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
word_text += symbol.text
if symbol.property.detected_break.type:
if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
word_text = ""
else:
print word_text,symbol.property.detected_break
word_text = ""
if __name__ == '__main__':
detect_breaks(sys.argv[1])