Search code examples
pythonapache-sparkpysparkapache-spark-sqlcasting

How to identify columns of datatype "long" and cast them to "int" in PySpark?


I have a table with 372 columns and it contains many columns having "long" datatype. I want to cast them to "int" datatype.

I found some solution from another similar question asked here, but it isn't working from me.

from pyspark.sql.functions import col

schema = {col: col_type for col, col_type in df.dtypes}
time_cols = [col for col, col_type in schema.items() if col_type in "timestamp date".split() or "date" in col or "time" in col]

for column in time_cols:
    df = df.withColumn(column, col(column).cast("to_timestamp"))

Solution

  • It's a good practice to use .select when possible instead of many .withColumn

    df = df.select(
        [col(c).cast('int') if t == 'bigint' else c for c, t in df.dtypes]
    )