I have read a dozen pages of docs, and it seems that:
I can skip learning the scala part.
The API is completely implemented in Python (I don't need to learn Scala for anything).
The interactive mode works as completely and quickly as the scala shell and troubleshooting is equally easy.
Python modules like NumPy will still be imported (no crippled Python environment).
Are there fall-short areas that will make it impossible?
In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
As of Spark 0.9, the main missing features in PySpark are:
SequenceFile
(there's an open pull request for this).Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
The Spark Python Programming Guide have a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample()
, sort()
, and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but many new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular CPython interpreter, so libraries like NumPy should work fine (this wouldn't have been the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark
interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark
in your shell to launch Pyspark with an IPython shell.