Search code examples
hadoopapache-sparkdataframehbasehadoop2

Apache Spark: Which data storage and data format to choose


I'm going to write a sales analytics application with Spark. Therefore I get a delta-dataset every night with new sales data (the sellings of the day before). Later I want to realize some analytics like Association-Rules or popularity of products.

The sales data contains information about:

  • store-id
  • article-group
  • timestamp of cash-point
  • article GTIN
  • amount
  • price

So far I used a simple .textFile method and RDDs in my Applications. I heard something about DataFrame and Parquet, which is a table-like data format for text files, right? And what about storing the data once in a database (I have HBase installed in a Hadoop cluster) and later read this?

Can someone give a short overview of the different types of save-/load-possibilities in Spark? And give a recommendation what to use for this data?

The data-amount are actually about 6 GB, which represent data data for 3 stores for about 1 year. Later I will work with data of ~500 stores and time-period of ~5 years.


Solution

  • You can use spark to process that data without any problem. You can read from a csv file as well(there's a library from databricks that supports csv). You can manipulate it, from an rdd your one step closer to turning it into a dataframe. And you can throw the final dataframe dirrectly into HBASE. All needed documentation you can find here: http://spark.apache.org/docs/latest/sql-programming-guide.html https://www.mapr.com/blog/spark-streaming-hbase

    Cheers, Alex