Search code examples
apache-beamapache-beam-ioapache-beam-internals

I see apache beam scales with # of csv files easiy but what about # lines in one csv?


I am currently reading this article and apache beam docs https://medium.com/@mohamed.t.esmat/apache-beam-bites-10b8ded90d4c

Every thing I have read is about N files. In our use case, we receive a pubsub event of ONE new file each time to kick off a job. I don't need to scale per file as I could use cloudrun for that. I need to scale with number of lines in files. ie. a 100 line file and a 100,000,000 line file, I would like to see processed in ~roughly the same time.

If I follow the above article and I give it ONE file instead of many, behind the scenes, how will apache beam scale. How will it know to use 1 node for 100 line vs. perhaps 1000 nodes for the 1,000,000 line file. After all, it doesn't know how many lines are in the file to begin with.

Does dataflow NOT scale with # of lines in the file? i was thinking perhaps node 1 would read rows 0-99 and node 2 would read/discard 0-99 and then read 100-199.

Does anyone know what is happening under the hood so I don't end up wasting hours of test time trying to figure out if it scales with respect to # of lines in a file?

EDIT: Related question but not same question - How to read large CSV with Beam?

I think dataflow may be bottlenecked by one node reading in the whole file which I could do on just a regular computer BUT I am really wondering if it could be better than this.

Another way to say this is behind the scenes, what is this line actually doing

PCollection<String> leftInput = TextIO.read().from(“left.csv”)

It could be 1 node reading and then sending to a bunch of other nodes, but there is a clear bottleneck if there is only 1 reader of the csv when the csv is bigdata size.

More context on my thinking. I do see a "HadoopFileSystem" connector(though we talk to GCP Storage). My guess is that the HadoopFileSystem one operates on the fact that HDFS has 'partition files' that represent the file so it is already N files. We use google cloud storage so it is simply one csv file not N files. While the HDFS connector could spin up the same number of nodes as partitions, the TextIO only sees one csv file and that is it.


Solution

  • Thankfully my colleague found this which only reads one line

    http://moi.vonos.net/cloud/beam-read-header/

    however it does show I think how to make sure to have code that is partitioning and different works read different parts of the file. I think this will solve it!!!

    If someone has a good example of csv partitioning, that would rock but we can try to create our own. currently, someone reads in the whole file.