Search code examples
apache-sparkapache-spark-sqlapache-spark-2.0

Apache Spark Dataframe - Load data from nth line of a CSV file


I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"

m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100

Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?

Dataset<Row> df = spark.read()
            .format("csv")
            .option("header", "true")
            .option("inferSchema", "true")
            .load("\home\user\data\20170326.csv");

Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?


Solution

  • You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.

    df = spark.read()
              .format("csv")
              .option("header", "true")
              .option("inferSchema", "true")
              .option("comment", "m")
              .load("\home\user\data\20170326.csv")