Search code examples
mongodbamazon-s3parquetpyarrowapache-arrow

PyArrow: How to batch data from mongo into partitioned parquet in S3


I want to be able to archive my data from Mongo into S3. Currently, what I do is

  • Read data from Mongo
  • Convert this into a pyarrow Table
  • Write to S3

It works for now, but steps 1 and 2 is kind of a bulk thing where if the result set is huge it needs to somehow fit into memory? I could do smaller chunks but I think that will mean I will get smaller files parquet files too which is not performant? Any best practice for this?

audits_collection: AsyncIOMotorCollection = db.get_collection("audits")
results = await audits_collection.find({
    "eventOccurredAt": {
        "$gte": from_date,
        "$lt": to_date,
    },
}).to_list(None) # OK for now but thinking this is not scalable if it gets too large

transformed = json.loads(json.dumps(results, default=str)) # Want to parse things like ObjectIds into string
table = pa.Table.from_pylist(transformed, schema=schema)

s3 = fs.S3FileSystem()
pq.write_to_dataset(
    table, 
    root_path="bucket-name/archive", 
    partition_cols=["date", "hour"],
    existing_data_behavior="delete_matching",
    filesystem=s3,
)

Solution

  • Your approach is overall fine, yes you will need to batch this to control memory constraints. I would say overall it's fine to self manage it with scripts similar to yours.

    At some point when your scale grows i'd recommend to use some kind of services, for example AWS offers aws dms which is their "data migration service", it can connect to a MongoDB instance and export data into s3. It is slightly more limited when it comes to restructuring, But you can then add additional scripts that do it over s3.

    Overall this is the "better practice" as they already deal with many of the issues you might encounter, obviously you might pay slightly more for this service and it couples you to their cloud even more.