Search code examples
scalahiveapache-spark-1.3

Incrementally adding to a Hive table w/Scala + Spark 1.3


Our cluster has Spark 1.3 and Hive There is a large Hive table that I need to add randomly selected rows to. There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.

When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)

I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds. One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.

Any feedback would be great. Thanks


Solution

  • I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:

    in Hive:

    create external table test(key string, value string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
    LOCATION '/SOME/HDFS/LOCATION'
    

    And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:

    df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)