Search code examples
apache-sparkpysparkapache-spark-sqlapache-hive

Save data as text file from spark to hdfs


I processed data using pySpark and sqlContext using the following query:

(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
           .rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))

It is stored in the following format:

Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)

But I want to store the data in a Hive table as

LastUpdate                           Count

2016-03-14 12:27:55.01                   1
.                                        .
.                                        .

Here is how I create the table in Hive:

CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';

I tried many options but was not successful. Please help me on this.


Solution

  • Why not load the data into Hive itself, without going through the process of saving the file and then loading it to hive.

    from datetime import datetime, date, time, timedelta
    hiveCtx = HiveContext(sc)
    
    #Create sample data
    currTime = datetime.now()
    currRow = Row(LastUpdate=currTime)
    delta = timedelta(days=1)
    futureTime = currTime + delta
    futureRow = Row(LastUpdate=futureTime)
    lst = [currRow, currRow, futureRow, futureRow, futureRow]
    
    #parallelize the list and convert to dataframe
    myRdd = sc.parallelize(lst)
    df = myRdd.toDF()
    df.registerTempTable("temp_t")
    aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
    aggRDD.saveAsTable("Data_Count")