Search code examples
apache-sparkapache-spark-sqldata-analysisbigdata

Apache Spark (scala) + python/R work flow for data analysis


I'm wondering what people are doing for data analysis with this stack. I'm particularly interested in the Spark Scala API since it seems to have newer features and it's more "natural" to Spark.

However, I'm unsure of what best practices are with respect to the data visualization and exploration once the big data was been crunched and reduced.

For example, I run a Spark job over ~2 Bn records and now I have a Spark dataframe consisting of around 100k records with some results that I want to histogram, plot, and apply some ML to, in either python or R.

What's the best way of achieving the handshake between these two worlds? Saving results to a file? (if so, what's the best option, parquet, avro, json, csv?) saving it to a DB?

Basically I'm wondering what other people find most confortable to work with a similar stack.


Solution

  • Once the data is been transformed or crunched in spark, you could consider the following to visualize the data.

    Apache zeppelin for interactive data analytics.

    Another option is Store the results of Spark jobs output in ElasticSearch and we can use Kibana to visualize.