Search code examples
hiveapache-spark-sqlhadoop-streaming

how to set up SQL/Hive connection with cloudera cluster to read data stored on cluster


I wanted to retrieve the data stored onto Hadoop Cloudera cluster either via Hive, Spark or SQL. I have SQL query written which should fetch data from the cluster. But prior to that, I want to understand how to set up connection /Cursor with cluster so that it will know where to read from or write to?

sc = spark.sparkContext or similarly HIVECONTEXT or SPARKCONTEXT will not suffice.

We might need to give URL for node and all. So how to do that?

Any Small example would suffice.


Solution

  • There are two ways to create the table in the hive:

    1- Creating an external table schema:

    CREATE EXTERNAL TABLE IF NOT EXISTS names_text(
              student_ID INT, FirstName STRING, LastName STRING,    
              year STRING, Major STRING)
              COMMENT 'Student Names'
              ROW FORMAT DELIMITED
              FIELDS TERMINATED BY ','
              STORED AS TEXTFILE
              LOCATION '/user/andrena';
    

    2- a) Create the schema for a managed table:

    CREATE TABLE IF NOT EXISTS Names(
      student_ID INT, FirstName STRING, LastName STRING,    
      year STRING, Major STRING)
      COMMENT 'Student Names'
      STORED AS ORC;
    

    b) Move the external table data to the managed table:

    INSERT OVERWRITE TABLE Names SELECT * FROM names_text;
    

    And finally, verify that the Hive warehouse stores the student names in the external and internal table respectively :

    SELECT * FROM names_text;
    
    SELECT * from Names;