Search code examples
amazon-web-servicesamazon-s3minio

Structure for data storage in MinIO / S3


I want to use MinIO/S3 to store data. My data consists of "projects". Each "project" can contain multiple "datasets". A "dataset" contains hundreds to thousands of files. My users are authenticated via LDAP, and I want to be able to assign users read access to these projects. So one user might have read access to multiple projects and all datasets within them. I also have a Postgres db aside. This db is basically the "master" of this project/dataset structure, and MinIO should simply mirror that.

My initial plan was to create one bucket for everything, that looks like this:

mybucket/dataset_uid_1
mybucket/dataset_uid_2
mybucket/dataset_uid_3
...

So MinIO/S3 wouldn't even know anything about the project structure. I would simply use my PostgresDB to create a policy for each user, that specifically grants him read access to the dataset_uids of the projects he has access to. Now that sounded perfectly fine for me, but I noticed that typically, S3 (and thus MinIO) typically limits both the length and number of policies that can be assigned to a user. So effectively, this would limit the number of projects and datasets a user could create. That sounded unacceptable.

The next idea was to have this structure in MinIO:

mybucket/project_id_1/dataset_id_1
mybucket/project_id_1/dataset_id_2
mybucket/project_id_2/dataset_id_1
...

Now I would create a group "project_id_1_readers" for example, and assign a policy that allows members of that group to read project_1, based on the prefix. That sounded reasonable, and as a long as a user could be a member of unlimited groups, there would be no limits in the number of projects and datasets. Now I noticed however, that MinIO does not appear (??) to allow creating MinIO-only groups for LDAP users. Is that correct?

Can anyone suggest a viable and "good/usual" "structure" for my data in MinIO, along with a way to set up the policies that allow me to define the access structure I want, while not being limited by the limits imposed by the policy size or number of allowed policies?

I am relatively new to S3 storage so please feel free to suggest other improvements as well.

More information:

Imagine this is some machine learning dataset containing training/test image data (its not quite that, but very similar). How I imagine users to access:

  1. User wants to upload/create a new dataset: A nodejs server interacting with MinIO/S3 + Postgres creates the dataset in the sql database, and will (somehow) enable the user to upload his data to the prefix mybucket/project_id/new_dataset_id/. This upload can occur via a React web interface (drag&drop folder) or a dedicated uploader application (that is yet to be developed). I have had this working with pre-signed urls already, but it feels weird that to upload a file, the web-frontend needs to contact node to obtain a pre-signed url for every single file. So I thought it would be nicer, if the web-frontend somehow obtained credentials to directly upload to MinIO without pre-signed URLs.
  2. User wants to download a dataset at some point. I imagined we would write a downloader application (might be some desktop CLI tool or something). In theory, this could work with pre-signed urls as well. It is possible, but it does feel weird, that we cannot simply rclone or something after obtaining (potentially temporary) credentials for MinIO/S3, but we have to write our own downloader that requests all pre-signed URLs from node.

Solution

  • Let's take a look at the way that people can get access to objects in MinIO/S3:

    1. Public Access

    The bucket makes everything accessible publicly. Not applicable for your situation.

    2. Provide AWS permanent credentials

    This is where each user has their own AWS User and associated credentials. Fine for your company's staff, but not appropriate for application users.

    3. Temporary AWS credentials

    The application could use Amazon Security Token Service (STS) to call GetSessionToken with a limited set of permissions. This would return a set of temporary credentials valid for up to 12 hours that can be used to call S3 APIs (eg using the AWS CLI). This is a great way to perform bulk operations such as uploading/downloading lots of files. However, it assumes that the users are running some software on their end that can use these credentials.

    Or you could use Amazon Cognito, which is an identity platform for web and mobile app. Once users authenticate they are provided with temporary credentials that lets them directly access AWS services.

    4. No AWS/MinIO credentials

    All requests are made to your application. The application verifies whether the user is permitted to access and object and then returns the object to the user. All interaction takes place between:

    • The user and your application, and
    • Your application and S3

    The downside is that a lot of traffic will go through your application so it needs to scale-up to handle the load.

    5. Use pre-signed URLs

    Again, your application is responsible for determining whether the user is permitted to access an object. If so, it returns a pre-signed URL that allows the user to download a private object directly from S3. This removes load from the application.

    Generating a pre-signed URL does not require communication with AWS/MinIO, but it does need access to the Secret Key associated with your AWS/MinIO credentials so that it can 'sign' the request. Thus, it isn't a good idea to generate pre-signed URLs in the front-end because the Secret Key could be exposed.

    You have concerns about generating hundreds/thousands of pre-signed URLs but I would be more concerned about determining how users will actually request those objects -- would they be individual calls to your application (#4 above), or calls to S3? That is the harder architectural decision.

    The benefit of pre-signed URLs is that they behave just like normal URLs and can be accessed via browser or any internet-connected app.