The from_csv function https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.from_csv.html , used to read a csv column in a dataframe, has a schema option. However this schema is expected to be a DDL string.
Is there a way to generate a DDL string from a StructType?
My problem with that is if I have a schema (StructType) I find no way in pyspark to generate a DDL string. The only way is to create a dataframe then access the java functions using _jf but this is not compatible with Spark Connect which is our way of developing.
In F.from_csv
you can use the string returned by df.schema.simpleString()
:
from pyspark.sql import functions as F
data = [
('1,ABC', ),
('2,ABC', ),
('3,ABC', ),
]
df = spark.createDataFrame(data, schema=['csv_line'])
parsed_col = F.from_csv('csv_line', schema='struct<id:int,string_col:string>')
df2 = df.select(parsed_col.alias('parsed_col'))
df2.printSchema()
# root
# |-- parsed_col: struct (nullable = true)
# | |-- id: integer (nullable = true)
# | |-- string_col: string (nullable = true)
df2.select('parsed_col.*').show(10, False)
# +---+----------+
# |id |string_col|
# +---+----------+
# |1 |ABC |
# |2 |ABC |
# |3 |ABC |
# +---+----------+