Search code examples
apache-sparkpysparkddl

from_csv schema option requests DDL but no possibility to create DDL


The from_csv function https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.from_csv.html , used to read a csv column in a dataframe, has a schema option. However this schema is expected to be a DDL string.

Is there a way to generate a DDL string from a StructType?

My problem with that is if I have a schema (StructType) I find no way in pyspark to generate a DDL string. The only way is to create a dataframe then access the java functions using _jf but this is not compatible with Spark Connect which is our way of developing.


Solution

  • In F.from_csv you can use the string returned by df.schema.simpleString():

    from pyspark.sql import functions as F
    
    data = [
        ('1,ABC', ),
        ('2,ABC', ),
        ('3,ABC', ),
    ]
    df = spark.createDataFrame(data, schema=['csv_line'])
    parsed_col = F.from_csv('csv_line', schema='struct<id:int,string_col:string>')
    
    df2 = df.select(parsed_col.alias('parsed_col'))
    df2.printSchema()
    # root
    #  |-- parsed_col: struct (nullable = true)
    #  |    |-- id: integer (nullable = true)
    #  |    |-- string_col: string (nullable = true)
    
    df2.select('parsed_col.*').show(10, False)
    
    # +---+----------+
    # |id |string_col|
    # +---+----------+
    # |1  |ABC       |
    # |2  |ABC       |
    # |3  |ABC       |
    # +---+----------+