I am implementing big data system using apache Kudu. Preliminary requirement are as follows:
Since Kudu does not support Multi tenancy OOB, I can think of a following way to support Multi tenancy.
Each table will have tenantID column and all data from all tenants will be stored in the same table with corresponding tenantID.
Map Kudu tables as an external tables in Impala. Create views for these tables with a where clause for each tenant like
CREATE VIEW IF NOT EXISTS cust1.table AS SELECT * FROM table WHERE tenantid = 'cust1';
Customer1 will access table cust1.table for accessing cust1's data using impala JDBC drivers or from Spark. Customer2 will access table cust2.table for accessing cust2's data and so on.
We had a meeting with Cloudera folks and following is the response we received for the questions I posted above
As pointed out by Samson in the comments, Kudu has either no access or full access policy as of now. Therefore the option suggested is to use Impala to access Kudu.
Therefore instead of having each table with TenantID column, each tenants tables are created separately. These Kudu tables are mapped in Impala as external tables (preferably in a separate Impala databases).
Access to these tables are then controlled using Sentry Authorization in Impala.
For Spark SQL access as well, suggested approach was to only make Imapala tables visible and not directly access Kudu tables. The authentication and authorization requirements are then handled again at Impala level before Spark Jobs are given access to the underneath Kudu tables.