I want to be able to archive my data from Mongo into S3. Currently, what I do is
It works for now, but steps 1 and 2 is kind of a bulk thing where if the result set is huge it needs to somehow fit into memory? I could do smaller chunks but I think that will mean I will get smaller files parquet files too which is not performant? Any best practice for this?
audits_collection: AsyncIOMotorCollection = db.get_collection("audits")
results = await audits_collection.find({
"eventOccurredAt": {
"$gte": from_date,
"$lt": to_date,
},
}).to_list(None) # OK for now but thinking this is not scalable if it gets too large
transformed = json.loads(json.dumps(results, default=str)) # Want to parse things like ObjectIds into string
table = pa.Table.from_pylist(transformed, schema=schema)
s3 = fs.S3FileSystem()
pq.write_to_dataset(
table,
root_path="bucket-name/archive",
partition_cols=["date", "hour"],
existing_data_behavior="delete_matching",
filesystem=s3,
)
Your approach is overall fine, yes you will need to batch this to control memory constraints. I would say overall it's fine to self manage it with scripts similar to yours.
At some point when your scale grows i'd recommend to use some kind of services, for example AWS offers aws dms which is their "data migration service", it can connect to a MongoDB instance and export data into s3. It is slightly more limited when it comes to restructuring, But you can then add additional scripts that do it over s3.
Overall this is the "better practice" as they already deal with many of the issues you might encounter, obviously you might pay slightly more for this service and it couples you to their cloud even more.