apache-spark apache-spark-sql apache-spark-2.0

Apache Spark Dataframe - Load data from nth line of a CSV file

I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"

m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100

Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?

Dataset<Row> df = spark.read()
            .format("csv")
            .option("header", "true")
            .option("inferSchema", "true")
            .load("\home\user\data\20170326.csv");

Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?

Solution

You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.

df = spark.read()
          .format("csv")
          .option("header", "true")
          .option("inferSchema", "true")
          .option("comment", "m")
          .load("\home\user\data\20170326.csv")