apache-spark hive pyspark parquet hortonworks-data-platform

How to create parquet table in Hive 3.1 through Spark 2.3 (pyspark)

Facing issues while creating/loading parquet table from Spark

Environment details:

Horotonworks HDP3.0

Spark 2.3.1

Hive 3.1

1#. When trying to create parquet table in Hive 3.1 through Spark 2.3, Spark throws below error.

df.write.format("parquet").mode("overwrite").saveAsTable("database_name.test1")

pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table datamart.test1 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);'

2#. Successfully able to insert data into existing parquet table and retrieve through Spark.

df.write.format("parquet").mode("overwrite").insertInto("database_name.test2")

spark.sql("select * from database_name.test2").show()

spark.read.parquet("/path-to-table-dir/part-00000.snappy.parquet").show()

But when I try to read the same table through Hive, Hive session gets disconnected and throws below error.

SELECT * FROM database_name.test2

org.apache.thrift.transport.TTransportException
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
        at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
        at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
        at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_FetchResults(TCLIService.java:567)
        at org.apache.hive.service.rpc.thrift.TCLIService$Client.FetchResults(TCLIService.java:554)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1572)
        at com.sun.proxy.$Proxy22.FetchResults(Unknown Source)
        at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:373)
        at org.apache.hive.beeline.BufferedRows.<init>(BufferedRows.java:56)
        at org.apache.hive.beeline.IncrementalRowsWithNormalization.<init>(IncrementalRowsWithNormalization.java:50)
        at org.apache.hive.beeline.BeeLine.print(BeeLine.java:2250)
        at org.apache.hive.beeline.Commands.executeInternal(Commands.java:1026)
        at org.apache.hive.beeline.Commands.execute(Commands.java:1201)
        at org.apache.hive.beeline.Commands.sql(Commands.java:1130)
        at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1425)
        at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1287)
        at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1071)
        at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:538)
        at org.apache.hive.beeline.BeeLine.main(BeeLine.java:520)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
Unknown HS2 problem when communicating with Thrift server.
Error: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed) (state=08S01,code=0)

After this error Hive session gets disconnected and I have to re-connect. All other queries are working fine, only this query is showing above error and getting disconnected.

Solution

This issue occurred because Hive tables were accessed without Hive Warehouse Connector.

By default spark uses spark catalog and below article explain how Apache Hive table can be accessed through Spark.

Integrating Apache Hive with Apache Spark - Hive Warehouse Connector

From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, and they use their own catalog; namely, they are mutually exclusive - Apache Hive catalog can only be accessed by Apache Hive or this library, and Apache Spark catalog can only be accessed by existing APIs in Apache Spark . In other words, some features such as ACID tables or Apache Ranger with Apache Hive table are only available via this library in Apache Spark. Those tables in Hive should not directly be accessible within Apache Spark APIs themselves.