python apache-spark distributed-computing partitioning rdd

RDD basics about partitions

I am reading Spark: RDD operations and I am executing:

In [7]: lines = sc.textFile("data")

In [8]: lines.getNumPartitions()
Out[8]: 1000

In [9]: lineLengths = lines.map(lambda s: len(s))

In [10]: lineLengths.getNumPartitions()
Out[10]: 1000

In [11]: len(lineLengths.collect())
Out[11]: 508524

but I would expect that my dataset gets split in parts, how many? As the number of partitions, i.e. 1000.

Then the map() would run on every partition and return a local result (which then should be reduced), but if this is the case I would expect lineLenghts which is a list of numbers, to have length equal to the #partitions, which is not the case.

What am I missing?

Solution

len(lineLengths.collect()) or lineLengths.count() tells you the number of rows in your rdd. lineLengths.getNumPartitions(), as you noted, is the number of partitions your rdd is distributed over. Each partition of the rdd contains many rows of the dataframe.