I have a custom ParDo function which gets data from an api and yields a pandas dataframe after every hit.
I perform some data manipulation and in the end i want to combine all those dataframes or pcollections into 1 before i write them to disk as a csv file.
Below is a basic representation of how my code works:
class GetData(beam.DoFn):
def __init__(self, hits):
self.no_of_hits = hits
def process(self, url):
for i in range(no_of_hits):
json = requests.get(url+no_of_hits)
df = pd.json_normalize(json)
yield df
with beam.Pipeline() as pipeline:
data = (pipeline
| "url to start the pipeline" >> beam.Create([url])
| "get data from api" >> beam.ParDo(GetData(hits)))
wrangled = (... some basic manipulation to each dataframe)
combine = ???
But i am new to apache beam so, i don't understand how i can do it.
I have tried to use beam.Flatten() but it takes an iterable as an input.
The Pcollection is not schema'd and not a deferred beam dataframe
Thank you, any help is appreciated
Update: It was not a good idea to go with pandas for this, as @XQHu pointed out it is not scalable.
I am using just python for now and might use beam dataframe if possible.
There is no straightforward step to create a beam dataframe as easy as you can in pandas for now (loading data through files creates a beam dataframe) might need some workaround for it.
Beam dataframe can utilize distributed processing.