Search code examples
apache-sparkrdd

Is there a size limit for Spark's RDD


Do spark's RDD have a limit in size?

As for my specific case, can a RDD have 2^400 colums?


Solution

  • Theoretically RDD doesn't have a size limit. Neither it has any limit on number of columns you can store. However there is a limitation from SPARK which allows each RDD partition to be capped at 2GB. See Here

    So, you can store the 2^400 columns in a RDD. As long as each partition size is less than 2GB.

    Now there are practical problems associated with having 2^400. Because you have to adhere current spark limitation , with huge number of columns you would need to repartition the data in to large number of partitions. This probably reduce the efficiency.