Search code examples
apache-sparkhadoophivehdfssqoop

Spark-Hive-Sqoop: Saving data in hive table using Spark showing junk char with Sqoop export


I am trying to load data from hive table(hivetable1) then making some modification to it using spark and again saving in another table(hivetable2) in hive. when I do select * from hivetable2, it shows me proper data but when I try to see the same file in hdfs it shows all junk char as below. When I try to export same data in postgres using Sqoop it appends the entire data in postgres table's single column.

Spark script:

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("hive.metastore.uris", "thrift://localhost:9083") \
    .config("spark.sql.catalogImplementation=hive") \
    .enableHiveSupport() \
    .getOrCreate() 

df = spark.sql("select * from hivetable1")

df.write.format("hive").mode('overwrite').option("delimiter", "\t").saveAsTable("hivetable2")

Hdfs file data:

hadoop fs -cat /user/hive/warehouse/tb.db/hivetable2/part-0000

enter image description here

lnullunknownnullnullnull\N\N\N\Nnullnullnullnullnull0.00.0nullnull\Nnull\Nnullnullnullnullnull\Nnullnull\Nnullnullnull\Nnullnullnull\Nnullnull

Sqoop Export:

 sqoop export --connect jdbc:postgresql://localhost:5432/postgres?stringtype=unspecified -m 1 --table test --export-dir /user/hive/warehouse/tb.db/hivetable2 \
 --username test --password test --input-fields-terminated-by '\t'

Solution

  • I would suggest to create a hive table seperately and append the data into the table. Secondly once done you can sqoop back to database.

    df.write.mode("overwrite").insertInto("database_name.table", overwrite=True)