Search code examples
metadatadatabricksaws-glueaws-glue-sparkaws-databricks

AWS glue: Deploy model in aws environment


As per our AWS environment , we have 2 different types SAGs( service account Group) for Data storage. One SAG is for generic storage , another SAG for secure data which will only hold PII or restricted data. In our environment, we are planning to deploy Glue . In that case , Would we have one metastore over both secure and non-secure? If we needed two meta stores, how would this work with Databricks? If one metastore, how to handle the secure datas ? Please help us to more details on this in .


Solution

    1. If you are using a single region with one AWS Account, there will be only one metastore for both secure and generic data, and you will have to handle access with fine grained access policies.
    2. A better approach would be to either use 2 different regions in a single AWS Account, or two different AWS accounts, so that access is easily managed for two different metastores.

    To integrate your metastore with Databricks for (1), you will have to create two Glue Catalog instance profiles with resource level access. One instance profile will have access to generic database and tables while the other will have access to the secure databases and tables.

    To integrate your metastores with Databricks for (2), you will simply create two Glue Catalog instance profiles with access to the respective metastore.

    It is recommended to go with the second option as it will save you guys a lot of maintenance cost and human errors on longer run. More details on Glue Catalog and Databricks integration.

    Edit: Based on the discussion in comments, if we have to access both datasets inside the same Databricks Runtime, option 2 won't work. Option 1 can be used with 2 permission sets. First only for generic data and second for both generic and secure data.