Search code examples
hadoophivehdfssqoop

Questions about Hive


I have this environment:

  • Haddop environment (1 master, 4 slaves) with several applications: ambari, hue, hive, sqoop, hdfs ... Server in production (separate from hadoop) with mysql database.

My goal is:

  • Optimize the queries made on this mysql server that are slow to execute today.

What did I do:

  • I imported the mysql data to HDFS using Sqoop.

My doubts:

  1. I can not make selects direct in HDFS using Hive?
  2. Do I have to load the data into Hive and make the queries?
  3. If new data is entered into the mysql database, what is the best way to get this data and insert it into HDFS and then insert it into Hive again? (Maybe in real time)

Thank you in advance


Solution

  • I can not make selects direct in HDFS using Hive?

    You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.

    Do I have to load the data into Hive and make the queries?

    In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.

    If new data is entered into the mysql database, what is the best way to get this data.

    You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.