Search code examples
character-encodinggoogle-cloud-dataflowapache-beam

Apache Beam / GCP Dataflow encoding issue


i am "playing" with apache beam/dataflow in datalab. I am trying to read a csv file from gcs. when i create the pcollection using:

lines = p | 'ReadMyFile' >> beam.io.ReadFromText('gs://' + BUCKET_NAME + '/' + input_file, coder='StrUtf8Coder')

I get the following error:

LookupError: unknown encoding: "THE","NAME","OF","COLUMNS"

it seems the name of columns is interpreted as encoding?

I do not understand what's wrong. If i do not specify the "coder" i get

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 1045: invalid continuation byte

Outside apache beam I am able to handle this error by reading the file from gcs:

blob = storage.Blob(gs_path, bucket)
data = blob.download_as_string()
data.decode('utf-8', 'ignore')

I read apache beam only support utf8 and the file does not contain only utf8.

Should I download and then convert to pcollection?

Any suggestion?


Solution

  • I would suggest changing the coding on the actual file. If you save the file with "Save as" you can select UTF-8 encoding for the format on excel CSVs and regular .txt. Once you do that you need to make sure you add a line of code like

    class DoWork(beam.DoFn):
      def process(self, text):
        text = textfilePcollection.encode('utf-8')
    
        Do other stuff
    

    This isn't how I would like to do it because it isn't code-centric, but it has work for me before. Unfortunately, I don't have a code-centric solution.