Different behavior of Spark reading CSV and text file using iso-8859-1 file

Problem: I'm having a problem with encoding conversion using a text file, a problem that doesn't happen when I use a csv file.

OS: Ubuntu 23.10

Scala: 2.13.12

Spark: 3.5.0

Code:

package sct

import org.apache.spark.sql.{DataFrame, DataFrameReader, Dataset, SparkSession}

object EncodingApp {
  def main(args: Array[String]): Unit = {
    val inFile: String = "ISO_8859_1.txt" // iso-8859-1 encoded file with only one line: "José, André"
    val spark: SparkSession = SparkSession.builder.appName("Encoding Application")
      .master("local[*]").getOrCreate()
    val reader: DataFrameReader = spark.read.option("encoding", "ISO-8859-1")

    val text: Dataset[String] = reader.textFile(inFile)
    val csv: DataFrame = reader.csv(inFile)

    text.show()
    csv.show()

    spark.close()
    spark.stop()
  }
}

Output:

+-----------+
|      value|
+-----------+
|Jos�, Andr�|
+-----------+

+----+------+
| _c0|   _c1|
+----+------+
|José| André|
+----+------+

What am I doing wrong?

Solution

The difference in behavior you're observing is likely due to how the Spark DataFrameReader handles text files versus CSV files, particularly in how the encoding is applied during the read process.

I would approach this as following. First importing the required libraries:

import spark.implicits._
import org.apache.spark.sql.functions._

Then reading the text as a binary file that then we can apply the right encoding

val text = spark.read
  .format("binaryFile")
  .load("ISO_8859_1.txt")
  .select(col("content"))
  .as[Array[Byte]]
  .flatMap(bytes => new String(bytes, "ISO-8859-1")
  .split("\n")) // Split lines here
  .toDF("value")

text.show()

However while read properly,right now the data is still in one column

+-----------+
|      value|
+-----------+
|José, André|
+-----------+

So in order to create the dataframe as you want it you can do the following:

val csvData = text
  .withColumn("_tmp", split(col("value"), ","))
  .select(
    trim(col("_tmp").getItem(0)).as("_c0"),
    trim(col("_tmp").getItem(1)).as("_c1")
  )

csvData.show()

result:

+----+-----+
| _c0|  _c1|
+----+-----+
|José|André|
+----+-----+

Note: you will probably need to make this whole thing into a function or something and change it to taste. But I think its a good start.