Search code examples
google-cloud-platformocrgoogle-cloud-vision

Google Vision OCR, rotate words coordinates to 0 degrees from 90, 180, 270 documents


The Problem

Given that we have following guide, taken from Google Vision OCR documentation https://developers.google.com/resources/api-libraries/documentation/vision/v1p1beta1/python/latest/vision_v1p1beta1.files.html

                  "boundingBox": { # A bounding polygon for the detected image annotation. # The bounding box for the paragraph.
                      # The vertices are in the order of top-left, top-right, bottom-right,
                      # bottom-left. When a rotation of the bounding box is detected the rotation
                      # is represented as around the top-left corner as defined when the text is
                      # read in the 'natural' orientation.
                      # For example:
                      #   * when the text is horizontal it might look like:
                      #      0----1
                      #      |    |
                      #      3----2
                      #   * when it's rotated 180 degrees around the top-left corner it becomes:
                      #      2----3
                      #      |    |
                      #      1----0
                      #   and the vertex order will still be (0, 1, 2, 3).

So as an experiment, i scanned the same document in the four different orientations and run it through Google's Vision OCR (DOCUMENT_TEXT_DETECTION). Namely 0, 90, 180, and 270 degrees. Which gives the following results from Google's OCR output.

Document with 0 degrees orientation. This is the default with horizontal text. It has 0 degree text rotation. It's four corners are:

0----1
|    |
3----2
Document height 3508
Document width 2479

Example of output text

LEGO - {'vertices': [{'x': 755, 'y': 172}, {'x': 877, 'y': 173}, {'x': 876, 'y': 237}, {'x': 754, 'y': 236}]}
LEGOLAND - {'vertices': [{'x': 1994, 'y': 189}, {'x': 2269, 'y': 192}, {'x': 2268, 'y': 244}, {'x': 1993, 'y': 241}]}

Document with 90 degrees orientation.

1----2
|    |
0----3
*vertex order will still be (0, 1, 2, 3)
Document height 2479
Document width 3508

Example of output text

LEGO - {'vertices': [{'x': 170, 'y': 1730}, {'x': 171, 'y': 1604}, {'x': 241, 'y': 1604}, {'x': 240, 'y': 1730}]}
LEGOLAND - {'vertices': [{'x': 188, 'y': 486}, {'x': 192, 'y': 213}, {'x': 245, 'y': 214}, {'x': 241, 'y': 487}]}

Document with 180 degrees orientation.

2----3
|    |
1----0
*vertex order will still be (0, 1, 2, 3)
Document height 3508
Document width 2479

Example of output text

LEGO - {'vertices': [{'x': 1740, 'y': 3337}, {'x': 1584, 'y': 3336}, {'x': 1585, 'y': 3259}, {'x': 1741, 'y': 3260}]}
LEGOLAND - {'vertices': [{'x': 485, 'y': 3315}, {'x': 212, 'y': 3311}, {'x': 213, 'y': 3261}, {'x': 486, 'y': 3265}]}

Document with 270 degrees orientation.

3----0
|    |
2----1
*vertex order will still be (0, 1, 2, 3)
Document height 2479
Document width 3508

Example of output text

LEGO - {'vertices': [{'x': 3335, 'y': 738}, {'x': 3333, 'y': 893}, {'x': 3269, 'y': 892}, {'x': 3271, 'y': 737}]}
LEGOLAND - {'vertices': [{'x': 3318, 'y': 1994}, {'x': 3313, 'y': 2266}, {'x': 3261, 'y': 2265}, {'x': 3266, 'y': 1993}]}

And now the question/problem

Given than we have a document scanned in 90, 180 and 270 degrees, how to mathematically turn the coordinates so that no matter which orientation it is scanned in, they all give the same result as the default 0 degrees document. Or with other words how to correct the coordinates of the 90, 180 and 270 degrees as if it was scanned with 0 degrees?

This problem might seem simple to some but i have been trying all kind of methods over the past few days and i cannot seem to figure it out.

So the input parameters are the scanned page orientation degree (0,90,180,270), the text vertices output from Google OCR and the page size (height and width) also from Google's OCR.

The output must be the corrected text vertices to a 0 degree page orientation


Solution

  • I'll give you the mathematical response. Bear in mind that math is an exact science whereas a Vision OCR scan is an empirical technique i.e. not exact science.

    Allow me to put a simple example so you can see the behavior. Imagine a document with height 10 and width 4 with a point at coordinates (1,9). When you turn it 90º the coordinates of the point become (9,3) then (3,1) and finally (1,1).

    The reason is that for a generic rectangle of height H and width W the 90º rotation of a point (a,b) yields:
    (a,b) -> (b,W-a) , W' = H , H' = W.

    This transformation repeated yields the 180º,270º transformations.
    That being the sequence (a,b) -> (b,W-a) -> (W-a,H-b) -> (H-b, a) -> (a,b)

    So, to turn it back to (a,b) from any point in the sequence is just a simple equation given that you know all the parameters in place.

    For example, for your 180º degree bounding boxes:

    LEGO - {'vertices': [{'x': 1740, 'y': 3337}, {'x': 1584, 'y': 3336}, {'x': 1585, 'y': 3259}, {'x': 1741, 'y': 3260}]}
    
    • Each x value follows x = width-x0 -> x0 = width - x
    • Each y value follows y = height-y0 -> y0 = height - y

    Which gives:

    LEGO - {'vertices': [{'x': 739, 'y': 171}, {'x': 895, 'y': 172}, {'x': 894, 'y': 249}, {'x': 738, 'y': 248}]}
    

    Of course this is slightly different from your original values. If you perform the simple transformations for all the rotations you can see that they differ slightly in all of them. Remember this are empirical "bounding boxes", they have an associated error and it is improbable to get them identical as it would for a "mathematical" problem.