apache-spark pyspark apache-spark-sql databricks parquet

read parquet files and skip rows that don't follow the schema

I have a bunch of parquet files stored on a S3 location which I want to load as a dataframe. When doing this, however, there are a few lines that I don't want/need to be part of the dataframe. What is the best way to filter the parquet files in such a way that these rows are not included in the dataframe? The parquet files consists out of gzipped csv files

The schema that I want to use is:

schema = StructType() \
      .add("col_1",LongType(),True) \
      .add("col_2",StringType(),True) \
      .add("col_3",StringType(),True)

The rows that I want to skip are rows in which either col_1 is empty or a string value and or col_2 is empty. Hope this question makes kind of sense...

PS: I should add that using the following syntax and trying the read the parquet files as a dataframe results in a fail:

df= spark.read.format("parquet").schema(schema).load('s3_bucket/folder_parquet_files/*')

Not directly, but when doing a display on the file am getting the following notification: Error while reading file XXXXX. Parquet column cannot be converted. Column: [col_1], Expected: LongType, Found: BINARY Caused by: SchemaColumnConvertNotSupportedException

Solution

You can apply a simple filter, you don't have to check col_1 has a string value or not because it's of type long so the only values it can have are long or nulls:

from pyspark.sql.functions import col, isnan, isnull
filtered_df = df.filter((~isnull('col_1')) & (~isnan('col_2')) & (col("col_2") != ""))