Search code examples
encryptionbigdataprovisioningdata-lake

Design data provisioning strategy for big data system?


I'm designing Data provisioning module in an big data system. Data provisioning is describe as

The process of providing the data from the Data Lake to downstream systems is referred to as Data Provisioning; it provides data consumers with secure access to the data assets in the Data Lake and allows them to source this data. Data delivery, access, and egress are all synonyms of Data Provisioning and can be used in this context.

in Data Lake Development with Big Data. I'm looking for some standards in designing this module, including how to secure the data, how to to identify some data is the data from the system, etc. I have searched on Google but there is not many results related to that keyword. Can you provide me with some advice or your own experience related to this problem? Every answer is appreciated.
Thank you!


Solution

  • Data Provisioning is mainly done by creating different Data Marts for your downstream consumers. For example, if you have a BigData system with data coming from various sources aggregated into one Data lake, yo can create different Data marts, like 'Purchase', 'Sales', 'Inventory' etc and let the down stream consume these. So a downstream consumer who needs only 'Inventory' data needs to consume only the 'Inventory' data mart.

    Your best bet is to search for 'Data Marts'. For example, ref: https://panoply.io/data-warehouse-guide/data-mart-vs-data-warehouse/ enter image description here

    Now you can fine tune the security, access control all based on the data mart. for example,

    'sales' data is accessible only for sales reporting systems, users, groups etc. Tokenize data in 'Purchase' data etc... All up to the business requirement.

    Another way is to export the aggregate data via data export mechanisms. For example use 'Apache Sqoop' to offload data to an RDBMS. This is approach is advisable when the data to export is smaller enough to be exported for the downstream consumer.

    Another way is to create separate 'Consumer Zones' in the same Data Lake, for exampele, be it a different Hadoop directory, or Hive DB.