Search code examples
pythonpathpython-imaging-librarygoogle-cloud-storage

Using PIL module to open file from GCS


I am a beginner in programming, and this is my first little try. I'm currently facing a bottleneck, I would like to ask for the help. Any advice will be welcome. Thank you in advance!

Here is what I want to do:

To make a text detection application and extract the text for the further usage(for instance, to map some of the other relevant information in a data). So, I devided into two steps: 1.first, to detect the text 2.extract the text and use the regular expression to rearrange it for the data mapping.

For the first step, I use google vision api, so I have no probelm reading the image from google cloud storage(code reference 1):

However, when it comes to step two, I need a PIL module to open the file for drawing the text. When useing the methodImage.open(), it requries a path`. My question is how do I call the path? (code reference 2):

code reference 1:

from google.cloud import vision

    image_uri = 'gs://img_platecapture/img_001.jpg'
    client = vision.ImageAnnotatorClient()
    image = vision.Image()
    image.source.image_uri = image_uri  ##  <- THE PATH  ##

    response = client.text_detection(image=image)
    for text in response.text_annotations:
        print('=' * 30)
        print(text.description)
        vertices = ['(%s,%s)' % (v.x, v.y) for v in text.bounding_poly.vertices]
        print('bounds:', ",".join(vertices))

    if response.error.message:
        raise Exception(
            '{}\nFor more info on error messages, check: '
            'https://cloud.google.com/apis/design/errors'.format(
                response.error.message))

code reference 2:

from PIL import Image, ImageDraw
from PIL import ImageFont
import re

img = Image.open(?)                        <- THE PATH  ##
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("simsun.ttc", 18)

for text in response.text_annotations[1::]:
  ocr = text.description
  bound=text.bounding_poly    
  draw.text((bound.vertices[0].x-25, bound.vertices[0].y-25),ocr,fill=(255,0,0),font=font)     
        
  draw.polygon(
         [
             bound.vertices[0].x,
             bound.vertices[0].y,
             bound.vertices[1].x,
             bound.vertices[1].y,
             bound.vertices[2].x,
             bound.vertices[2].y,
             bound.vertices[3].x,
             bound.vertices[3].y,
         ],
         None,
         'yellow',
       
         )
  texts=response.text_annotations

  a=str(texts[0].description.split())
  b=re.sub(u"([^\u4e00-\u9fa5\u0030-u0039])","",a) 
    b1="".join(b)
  

    regex1 = re.search(r"\D{1,2}Dist.",b) 
    if regex1:
        regex1="{}".format(regex1.group(0))

     .........

Solution

  • PIL does not have built in ability to automatically open files from GCS. you will need to either

    1. Download the file to local storage and point PIL to that file or

    2. Give PIL a BlobReader which it can use to access the data:

      from PIL import Image
      from google.cloud import storage
      
      storage_client = storage.Client()
      bucket = storage_client.bucket('img_platecapture')
      blob = bucket.get_blob('img_001.jpg')  # use get_blob to fix generation number, so we don't get corruption if blob is overwritten while we read it.
      with blob.open() as file:
         img = Image.open(file)
         # ...