Search code examples
hadoophiveapache-pigrdbmsbigdata

Handling complex joins in big data


OLTP systems are normalized and contains hundreds of tables. We may need to join a lot of tables while designing reports. For example, I need to join four or five tables in order to get just email address of customers. ( thanks to the telecommunications model we use)

I am trying to design a data warehouse in hdfs (or maybe s3).

In order to write emaill addresses and custom ids into a table on hive: I need to extract whole data in five tables into hdfs by using sqoop import command. Then I need to join these tables in hive. So this process may last long for such getting email information.

Or I may import data by using sqoop query command. This seems easier and will hold less space in hdfs but query execution time is crucial in this case.

What is your choice in similar situations? Do you have alternative approaches?

Thank you


Solution

  • You should consider denormalising your schema. E.g. import all tables you might possibly need in hive with sqoop, prepare an ETL process that would denormalise that to a star schema and use the denormalised new schema for querying.

    E.g. you should aim for a "customers" table where email is just one of the fields.