Search code examples
pythoncsvpandasapache-sparkibm-cloud

pandas.read_csv in Spark environment (IBM Bluemix)


I'm using IPython in a Spark/Bluemix environment

I have a csv uploaded to the the object store and I can read it ok using sc.textfile but I get file does not exist when I use pandas pd.read_csv

  1. data = sc.textFile("swift://notebooks.books/rtenews.csv")

  2. import pandas as pd data = pd.read_csv('swift://notebooks.books/rtenews.csv')

IOError File swift://notebooks.books/rtenews.csv does not exist

Why is this? How can I read a csv file to a pandas dataframe?


Solution

  • Once you have uploaded the CSV file to your Bluemix Object Storage, you can read the CSV file using Spark directly:

    data = sc.textFile("swift://notebooks.books/rtenews.csv")
    

    This is possible, because configurations have been done to enable this feature.

    If you try to read the CSV file with the following code using pandas:

    import pandas as pd 
    data = pd.read_csv('swift://notebooks.books/rtenews.csv')
    

    This will not work, because pandas do not support direct access of Bluemix Object Storage. Have a look at the API documentation of pandas.read_csv(): http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html It supports a few valid URL schemes, only.

    However, it is possible to read a CSV file on you Bluemix Object Storage as StringIO object into pandas.DataFrame.

    You can find the instructions in "Precipitation Analysis" sample notebook:

    Use this approach not for large CSV files!