Search code examples
apache-sparkspark-csv

Why specifying schema to be DateType / TimestampType will make querying extremely slow?


I'm using spark-csv 1.1.0 and Spark 1.5. I make the schema as follows:

private def makeSchema(tableColumns: List[SparkSQLFieldConfig]): StructType = {
    new StructType(
      tableColumns.map(p => p.ColumnDataType match {
        case FieldDataType.Integer => StructField(p.ColumnName, IntegerType, nullable = true)
        case FieldDataType.Decimal => StructField(p.ColumnName, FloatType, nullable = true)
        case FieldDataType.String => StructField(p.ColumnName, StringType, nullable = true)
        case FieldDataType.DateTime => StructField(p.ColumnName, TimestampType, nullable = true)
        case FieldDataType.Date => StructField(p.ColumnName, DateType, nullable = true)
        case FieldDataType.Boolean => StructField(p.ColumnName, BooleanType, nullable = false)
        case _ => StructField(p.ColumnName, StringType, nullable = true)
      }).toArray
    )
  }

But when there are DateType columns, my query with Dataframes will be very slow. (The queries are just simple groupby(), sum() and so on)

With the same dataset, after I commented the two lines to map Date to DateType and DateTime to TimestampType(that is, to map them to StringType), the queries become much faster.

What is the possible reason for this? Thank you very much!


Solution

  • We have found a possible answer for this problem.

    When simply specifying a column to be DateType or TimestampType, spark-csv will try to parse the dates with all its internal formats for each line of the row, which makes the parsing progress much slower.

    From its official documentation, it seems that we can specify in the option the format for the dates. I suppose it can make the parsing progress much faster.