spark-csv or pyspark-csv in Spark environment (IBM Bluemix)

I need to load a number of large CSV files in Spark on Bluemix.

I can do it via sc.testFile and then map it, but that requires repetitive and cumbersome code.

Is there a way to add/load either databricks spark-csv package, or pyspark-csv to the environment (tried but it didn't like it)?

I saw the example of doing it via pandas, but since some of the files could be very large (10's GB's), it didn't sound like that was a good idea. This is in Python but I could switch to Scala.

Solution

In a Python notebook, you can use

sc.addPyFile("https://raw.githubusercontent.com/seahboonsiew/pysparkcsv/master/pyspark_csv.py")

to add pyspark-csv to your runtime environment. Have a look at the "NY Motor Vehicle Accidents Analysis" sample notebook, in which we added pyspark-csv.

In a Scala notebook, you can use

%AddDeps com.databricks spark-csv_2.10 1.3.0 --transitive`

to add spark-csv. Of course, you can choose a different version.

What do you mean by "(tried but it didn't like it)?" ?

Loading large amount of data into a pandas.DataFrame is not a good idea, you are right.