Search code examples
pythonfileapache-sparkpyspark

Read TSV file in pyspark


What is the best way to read .tsv file with header in pyspark and store it in a spark data frame.

I am trying to use "spark.read.options" and "spark.read.csv" commands however no luck.

Thanks.

Regards, Jit


Solution

  • Well you can directly read the tsv file without providing external schema if there is header available as:

    df = spark.read.csv(path, sep=r'\t', header=True).select('col1','col2')
    

    Since spark is lazily evaluated it'll read only selected columns. Hope it helps.