Search code examples
pythonjsoncsvapache-beamapache-beam-io

Use Apache beam `GroupByKey` and construct a new column - Python


From this question: How to group data and construct a new column - python pandas?, I know how to groupby multiple columns and construct a new unique id by using pandas, but if I want to use Apache beam in Python to achieve the same thing that is described in that question, how can I achieve it and then write the new data to a newline delimited JSON format file (each line is one unique_id with an array of objects that belong to that unique_id)?

Assuming the dataset is stored in a csv file.

I'm new to Apache beam, here's what I have now:

import pandas
import apache_beam as beam
from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
    df = p | read_csv("example.csv", names=cols)
    agg_df = df.insert(0, 'unique_id',
          df.groupby(['postcode', 'house_number'], sort=False).ngroup())
    agg_df.to_csv('test_output')        

This gave me an error:

NotImplementedError: 'ngroup' is not yet supported (BEAM-9547)

This is really annoying, I'm not very familiar with Apache beam, can someone help please...

(ref: https://beam.apache.org/documentation/dsls/dataframes/overview/)


Solution

  • Assigning consecutive integers to a set is not something that's very amenable to parallel computation. It's also not very stable. Is there any reason another identifier (e.g. the tuple (postcode, house_number) or its hash would not be suitable?