Search code examples
pythonconcatenationapache-beam

Concat keys in PCollection


Im trying to concat/join the values of 2 keys in apache beam to get a new list composed of all items in the two keys.

Suppose I have a PCollection as follows:

(
  "Key1": [file1, file2],
  "Key2": [file3, file4],
)

How do I achieve a PColletion which looks like this using the python apache-beam sdk:

(
   "Key3": [file1, file2, file3, file4]
)

Solution

  • I solved this using the following code

    new_pcol = (
        (
            pcol1,
            pcol2,
        )
        | "Flatten" >> beam.Flatten()
        | "Format flat" >> beam.MapTuple(lambda k, files: ("Key3", files))
        | "Group by new key" >> beam.GroupByKey()
    )