Search code examples
scalaapache-sparkapache-spark-sqlhive

Issues with SparkSQL (Spark and Hive connectivity)


I am trying to retrieve data from a database made in Hive into my Spark and even if there's data in the DB (I checked it with Hive) doing a query with Spark returns no rows (it returns the column information though).

I have copied the hive-site.xml file into the Spark configuration folder (was asked for).

IMPORTS

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveContext

Creating a Spark session:

val spark = SparkSession.builder().appName("Reto").config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path").enableHiveSupport().getOrCreate() 
    spark.sql("show databases").show()

Getting data:

spark.sql("USE retoiabd")
val churn = spark.sql("SELECT count(*) FROM churn").show()

Output:

count(1) = 0

Solution

  • After checking it out with our teacher there was an issue with the creation of the tables themselves in Hive.

    We created the table like this:

    CREATE TABLE name (columns)

    Instead of like this:

    CREATE EXTERNAL TABLE name (columns)