Search code examples
pythongoogle-cloud-platformgoogle-cloud-dataflowapache-beam

How does one access an apache_beam.io.fileio.ReadableFile() object?


I am trying to use the apache_beam.io.fileio module in order to read from a file lines.txt and incorporate it into my pipeline.

lines.txt has the following contents:

line1
line2
line3

When I run the following pipeline code:

with beam.Pipeline(options=pipeline_options) as p:

     lines = (
         p
         | beam.io.fileio.MatchFiles(file_pattern="lines.txt")
         | beam.io.fileio.ReadMatches()
     )
     # print file contents to screen
     lines | 'print to screen' >> beam.Map(print)

I get the following output:

<apache_beam.io.fileio.ReadableFile object at 0x000001A8C6C55F08>

I expected

line1
line2
line3

How can I yield my expected result?


Solution

  • The resulting PCollection from

    p
    | beam.io.fileio.MatchFiles(file_pattern="lines.txt")
    | beam.io.fileio.ReadMatches()
    

    is a ReadableFile object. In order to access this object, we can use various functions as documented in the apache beam pydoc.

    Below we implement read_utf8():

    with beam.Pipeline(options=pipeline_options) as p:
    
        lines = (
            p
            | beam.io.fileio.MatchFiles(file_pattern="lines.txt")
            | beam.io.fileio.ReadMatches()
            | beam.Map(lambda file: file.read_utf8())
        )
        # print file contents to screen
        lines | 'print to screen' >> beam.Map(print)
    

    and we get our expected result:

    line1
    line2
    line3