Search code examples
apache-sparkibm-cloudpyspark

spark-csv or pyspark-csv in Spark environment (IBM Bluemix)


I need to load a number of large CSV files in Spark on Bluemix.

I can do it via sc.testFile and then map it, but that requires repetitive and cumbersome code.

Is there a way to add/load either databricks spark-csv package, or pyspark-csv to the environment (tried but it didn't like it)?

I saw the example of doing it via pandas, but since some of the files could be very large (10's GB's), it didn't sound like that was a good idea. This is in Python but I could switch to Scala.


Solution

  • In a Python notebook, you can use

    sc.addPyFile("https://raw.githubusercontent.com/seahboonsiew/pysparkcsv/master/pyspark_csv.py")
    

    to add pyspark-csv to your runtime environment. Have a look at the "NY Motor Vehicle Accidents Analysis" sample notebook, in which we added pyspark-csv.

    In a Scala notebook, you can use

    %AddDeps com.databricks spark-csv_2.10 1.3.0 --transitive`
    

    to add spark-csv. Of course, you can choose a different version.

    What do you mean by "(tried but it didn't like it)?" ?

    Loading large amount of data into a pandas.DataFrame is not a good idea, you are right.