Search code examples
multi-tenantimpalaapache-kudu

Multi-tenancy implementation with Apache Kudu


I am implementing big data system using apache Kudu. Preliminary requirement are as follows:

  1. Support Multi-tenancy
  2. Front end will use Apache Impala JDBC drivers to access data.
  3. Customers will write Spark Jobs on Kudu for analytical use cases.

Since Kudu does not support Multi tenancy OOB, I can think of a following way to support Multi tenancy.

Way:

Each table will have tenantID column and all data from all tenants will be stored in the same table with corresponding tenantID.

Map Kudu tables as an external tables in Impala. Create views for these tables with a where clause for each tenant like

CREATE VIEW IF NOT EXISTS cust1.table AS SELECT * FROM table WHERE tenantid = 'cust1';

Customer1 will access table cust1.table for accessing cust1's data using impala JDBC drivers or from Spark. Customer2 will access table cust2.table for accessing cust2's data and so on.

Questions:

  1. Is this an acceptable way to implement multi-tenancy or is there a better way to do it (may be with other external services)
  2. If implemented this way, how do I restrict customer2 from accessing cust1.table in Kudu especially when customer would write their own spark jobs for analytical purposes.

Solution

  • We had a meeting with Cloudera folks and following is the response we received for the questions I posted above

    Questions:

    1. Is this an acceptable way to implement multi-tenancy or is there a better way to do it (may be with other external services)
    2. If implemented this way, how do I restrict customer2 from accessing cust1.table in Kudu especially when customer would write their own spark jobs for analytical purposes.

    Answers:

    1. As pointed out by Samson in the comments, Kudu has either no access or full access policy as of now. Therefore the option suggested is to use Impala to access Kudu.

      Therefore instead of having each table with TenantID column, each tenants tables are created separately. These Kudu tables are mapped in Impala as external tables (preferably in a separate Impala databases).

      Access to these tables are then controlled using Sentry Authorization in Impala.

    2. For Spark SQL access as well, suggested approach was to only make Imapala tables visible and not directly access Kudu tables. The authentication and authorization requirements are then handled again at Impala level before Spark Jobs are given access to the underneath Kudu tables.