I'm using IPython
in a Spark/Bluemix
environment
I have a csv uploaded to the the object store and I can read it ok using sc.textfile
but I get file does not exist
when I use pandas pd.read_csv
data = sc.textFile("swift://notebooks.books/rtenews.csv")
import pandas as pd
data = pd.read_csv('swift://notebooks.books/rtenews.csv')
IOError File swift://notebooks.books/rtenews.csv does not exist
Why is this?
How can I read a csv file to a pandas
dataframe?
Once you have uploaded the CSV file to your Bluemix Object Storage, you can read the CSV file using Spark directly:
data = sc.textFile("swift://notebooks.books/rtenews.csv")
This is possible, because configurations have been done to enable this feature.
If you try to read the CSV file with the following code using pandas
:
import pandas as pd
data = pd.read_csv('swift://notebooks.books/rtenews.csv')
This will not work, because pandas
do not support direct access of Bluemix Object Storage. Have a look at the API documentation of pandas.read_csv()
: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
It supports a few valid URL schemes, only.
However, it is possible to read a CSV file on you Bluemix Object Storage as StringIO
object into pandas.DataFrame
.
You can find the instructions in "Precipitation Analysis" sample notebook:
Use this approach not for large CSV files!