Search code examples
csvapache-sparkpyspark

Custom delimiter csv reader spark


I would like to read in a file with the following structure with Apache Spark.

628344092\t20070220\t200702\t2007\t2007.1370

The delimiter is \t. How can I implement this while using spark.read.csv()?

The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to

pandas.read_csv(file, sep = '\t')

Thanks a lot!


Solution

  • Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.

    If it's literally \t, not tab special character, use double \: spark.read.option("delimiter", "\\t").csv(file)