I had the Spark to HAWQ JDBC connection working, but now two days later there's an issue with extracting data from a table. Nothing's changed on the Spark configuration...
Simple step #1 - printing the schema from a simple table in HAWQ I can create a SQLContext DataFrame and connect to the HAWQ db:
df = sqlContext.read.format('jdbc').options(url=db_url, dbtable=db_table).load()
df.printSchema()
Which prints:
root
|-- product_no: integer (nullable = true)
|-- name: string (nullable = true)
|-- price: decimal (nullable = true)
But when actually trying to extract data:
df.select("product_no").show()
These errors pop up...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
org.postgresql.util.PSQLException: ERROR: could not write 3124 bytes to temporary file: No space left on device (buffile.c:408) (seg33 adnpivhdwapda04.gphd.local:40003 pid=544124) (cdbdisp.c:1571)
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:615)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:465)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:350)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:372)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:350)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
Things I've tried (but willing to try again if there's more precise steps):
This can't be an actual memory deficiency so where and what is tripping this up??
could not write 3124 bytes to temporary file: No space left on device
The volume used for tempfiles is filling up. Then the temp file will be deleted on error, so you don't actually see the full volume in df
.
This could be a tempfs, like /tmp
, on most Linux systems. If so, it's backed by virtual memory. To confirm, check mount
and check the setting of PostgreSQL's temp_tablespaces
(SHOW temp_tablespaces
). If it's blank PostgreSQL will use the default tablespace, which is unlikely to be a tempfs, but if it's set check where that tablespace is. If it's a on tempfs you may need to move it.
It could also be somehow filling the main tablespace, but that's exceedingly unlikely if it's only at 1% utilization at the moment. Perhaps a massively runaway recursive CTE could do it, but it's unlikely.
Quota management is also a possibility. Perhaps a filesystem quota is configured?