Search code examples
csvapache-sparkpysparkapache-spark-mllibgoogle-cloud-dataproc

How to import csv files with massive column count into Apache Spark 2.0


I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class label.

When I run the following in pyspark

csvdata = spark.read.csv("gs://[bucket]/csv/*.csv", header=True,mode="DROPMALFORMED")

I get a

File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o53.csv. : com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480 Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse Parser Configuration: CsvParserSettings:

  1. Where/how do I set the maximum columns for the parser to use the data for machine learning.
  2. Is there a better way to ingest the data for use with Apache mllib?

This question points to defining a class for the dataframe to use but would it be possible to define such a large class without having to create 210,000 entries?


Solution

  • Use option:

    spark.read.option("maxColumns", n).csv(...)
    

    where n is number of columns.