Search code examples
pythonpandasdataframepipelineapache-beam

How to combine or merge pcollections (multiple pardo yields) in apache beam python


I have a custom ParDo function which gets data from an api and yields a pandas dataframe after every hit.

I perform some data manipulation and in the end i want to combine all those dataframes or pcollections into 1 before i write them to disk as a csv file.

Below is a basic representation of how my code works:

class GetData(beam.DoFn):
    def __init__(self, hits):
        self.no_of_hits = hits

    def process(self, url):
        for i in range(no_of_hits):
            json = requests.get(url+no_of_hits)
            df = pd.json_normalize(json)
            yield df
    
with beam.Pipeline() as pipeline:
        data = (pipeline
            | "url to start the pipeline" >> beam.Create([url])
            | "get data from api" >>   beam.ParDo(GetData(hits)))
        wrangled = (... some basic manipulation to each dataframe)
        combine = ???

But i am new to apache beam so, i don't understand how i can do it.

I have tried to use beam.Flatten() but it takes an iterable as an input.

The Pcollection is not schema'd and not a deferred beam dataframe

Thank you, any help is appreciated


Solution

  • Update: It was not a good idea to go with pandas for this, as @XQHu pointed out it is not scalable.

    I am using just python for now and might use beam dataframe if possible.

    There is no straightforward step to create a beam dataframe as easy as you can in pandas for now (loading data through files creates a beam dataframe) might need some workaround for it.

    Beam dataframe can utilize distributed processing.