I want to use MinIO/S3 to store data. My data consists of "projects". Each "project" can contain multiple "datasets". A "dataset" contains hundreds to thousands of files. My users are authenticated via LDAP, and I want to be able to assign users read access to these projects. So one user might have read access to multiple projects and all datasets within them. I also have a Postgres db aside. This db is basically the "master" of this project/dataset structure, and MinIO should simply mirror that.
My initial plan was to create one bucket for everything, that looks like this:
mybucket/dataset_uid_1
mybucket/dataset_uid_2
mybucket/dataset_uid_3
...
So MinIO/S3 wouldn't even know anything about the project structure. I would simply use my PostgresDB to create a policy for each user, that specifically grants him read access to the dataset_uids of the projects he has access to. Now that sounded perfectly fine for me, but I noticed that typically, S3 (and thus MinIO) typically limits both the length and number of policies that can be assigned to a user. So effectively, this would limit the number of projects and datasets a user could create. That sounded unacceptable.
The next idea was to have this structure in MinIO:
mybucket/project_id_1/dataset_id_1
mybucket/project_id_1/dataset_id_2
mybucket/project_id_2/dataset_id_1
...
Now I would create a group "project_id_1_readers" for example, and assign a policy that allows members of that group to read project_1, based on the prefix. That sounded reasonable, and as a long as a user could be a member of unlimited groups, there would be no limits in the number of projects and datasets. Now I noticed however, that MinIO does not appear (??) to allow creating MinIO-only groups for LDAP users. Is that correct?
Can anyone suggest a viable and "good/usual" "structure" for my data in MinIO, along with a way to set up the policies that allow me to define the access structure I want, while not being limited by the limits imposed by the policy size or number of allowed policies?
I am relatively new to S3 storage so please feel free to suggest other improvements as well.
More information:
Imagine this is some machine learning dataset containing training/test image data (its not quite that, but very similar). How I imagine users to access:
rclone
or something after obtaining (potentially temporary) credentials for MinIO/S3, but we have to write our own downloader that requests all pre-signed URLs from node.Let's take a look at the way that people can get access to objects in MinIO/S3:
The bucket makes everything accessible publicly. Not applicable for your situation.
This is where each user has their own AWS User and associated credentials. Fine for your company's staff, but not appropriate for application users.
The application could use Amazon Security Token Service (STS) to call GetSessionToken
with a limited set of permissions. This would return a set of temporary credentials valid for up to 12 hours that can be used to call S3 APIs (eg using the AWS CLI). This is a great way to perform bulk operations such as uploading/downloading lots of files. However, it assumes that the users are running some software on their end that can use these credentials.
Or you could use Amazon Cognito, which is an identity platform for web and mobile app. Once users authenticate they are provided with temporary credentials that lets them directly access AWS services.
All requests are made to your application. The application verifies whether the user is permitted to access and object and then returns the object to the user. All interaction takes place between:
The downside is that a lot of traffic will go through your application so it needs to scale-up to handle the load.
Again, your application is responsible for determining whether the user is permitted to access an object. If so, it returns a pre-signed URL that allows the user to download a private object directly from S3. This removes load from the application.
Generating a pre-signed URL does not require communication with AWS/MinIO, but it does need access to the Secret Key associated with your AWS/MinIO credentials so that it can 'sign' the request. Thus, it isn't a good idea to generate pre-signed URLs in the front-end because the Secret Key could be exposed.
You have concerns about generating hundreds/thousands of pre-signed URLs but I would be more concerned about determining how users will actually request those objects -- would they be individual calls to your application (#4 above), or calls to S3? That is the harder architectural decision.
The benefit of pre-signed URLs is that they behave just like normal URLs and can be accessed via browser or any internet-connected app.