From this question: How to group data and construct a new column - python pandas?, I know how to groupby multiple columns and construct a new unique id by using pandas
, but if I want to use Apache beam
in Python to achieve the same thing that is described in that question, how can I achieve it and then write the new data to a newline delimited JSON format file (each line is one unique_id
with an array of objects that belong to that unique_id)?
Assuming the dataset is stored in a csv file.
I'm new to Apache beam, here's what I have now:
import pandas
import apache_beam as beam
from apache_beam.dataframe.io import read_csv
with beam.Pipeline() as p:
df = p | read_csv("example.csv", names=cols)
agg_df = df.insert(0, 'unique_id',
df.groupby(['postcode', 'house_number'], sort=False).ngroup())
agg_df.to_csv('test_output')
This gave me an error:
NotImplementedError: 'ngroup' is not yet supported (BEAM-9547)
This is really annoying, I'm not very familiar with Apache beam, can someone help please...
(ref: https://beam.apache.org/documentation/dsls/dataframes/overview/)
Assigning consecutive integers to a set is not something that's very amenable to parallel computation. It's also not very stable. Is there any reason another identifier (e.g. the tuple (postcode, house_number)
or its hash would not be suitable?