Search code examples
csvpysparktxt

Trying to Read CSV Files in PySpark but it is also reading Text Files


I have a folder having .txt and .csv files (having exactly same column names)

However, while I am trying to read only CSV Files in PySpark and trying the following code below it is reading and appending both text and csv files together

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSV Reader").getOrCreate()

csv_path = "path/to/csv/folder"

df = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(csv_path)

Solution

  • You can use pathGlobFilter as an option and define a pattern to read only .csv files

    spark.read.format("csv").option('pathGlobFilter', '*.csv').load(csv_path)
    

    Hope this is going to help I've found that option here: https://dbmstutorials.com/pyspark/spark-read-write-dataframe-options.html