Search code examples
sql-serverhadoophivesqoopcloudera-quickstart-vm

best practice to load multiple client data into Hadoop


We are creating POC on Hadoop framework with Cloudera CDH. We want to load data of multiple client into Hive tables.

As of now, we have separate database for each client on SQL Server. This infrastructure will remain same for OLTP. Hadoop will be used for OLAP. We have some primary dimension tables which are same for each client. All client database has exact same schema. These tables have same primary key value. Till now, this was fine as we have separate database for client. Now we are trying to load multiple client data into same data container (Hive tables). Now we will have multiple row with same primary key value if we load data directly into Hive from multiple SQL Server databases through Sqoop job. I am thinking to use the surrogate key in Hive tables but Hive does not support auto increment but can be achieved with UDF.

We don't want to modify the SQL Server data as it's running production data.

a. What are the standard/generic way/solution to load multiple client data into Hadoop ecosystem?

b. How primary key of sql server database table can be mapped easily to Hadoop Hive table ?

c. How we can ensure that one client is never able to see the data of other client?

Thanks


Solution

  • @Praveen: Use mappers to overcome the downtime for each clients data to Hadoop servers, as Client data holds the primary keys in this case. Use the best use of Partitions for each client and with respect to Date partition. You have to implement TDE zone for HDFS file location, before you start sqoop import. *TDE: Trasparent Data Encryption zone, best practice for secured zone for your client data.