I am fairly new to PySpark/Hive and I have a problem:
I have a dataframe and want to write it as a paritioned table to HDFS. So far, I've done that via:
df = spark.sql('''
CREATE EXTERNAL TABLE database.df(
ID STRING
)
PARTITIONED BY (
DATA_DATE_PART STRING
)
STORED AS PARQUET
LOCATION 'hdfs://path/file'
''')
df.createOrReplaceTempView("df")
df = spark.sql('''
INSERT INTO database.df PARTITION(DATA_DATE_PART = '{}')
SELECT ID
FROM df
'''.format(date))
But as with growing dataframes, instead of having to define all columns, I thought there is a better solution to this:
df.write.mode('overwrite').partitionBy('DATA_DATE_PART').parquet("/path/file")
However, a table like this I cannot access via spark.sql nor see it in my HUE browser. I can see it though via PySpark shell: hdfs dfs -ls /path/
So my question, why is that? I've read that parquet files can be special when reading with SQL but my first script does well and the tables are visible everywhere.
You just need to use saveAsTable
function for that (doc). By default it stores data in the default location, but you can use the path
option to redefine it & make a table "unmanaged" (see this doc for more details). Just use following code:
df.write.mode('overwrite').partitionBy('DATA_DATE_PART') \
.format("parquet") \
.option("path", "/path/file") \
.saveAsTable("database.df")