Search code examples
pysparkparquet

Pyspark write dataframe to parquet file format use variable name as directory which is not part of dataframe schema


Is there any way in Pyspark to write dataframe to parquet file format use variable name as directory which is not part of dataframe schema.

Code

tables_list = ['abc','def','xyz']
for table_name in tables_list:
    df.write.parquet(os.path.join("s3://bucket/output/"), table_name)

Error

table_name(abc,def,xyz) is not part of the schema.

Solution

  • Looks like there is a syntax error. Your code should work if you move second parantheses to end of the line for following line

     df.write.parquet(os.path.join("s3://bucket/output/"), table_name)
    

    I didn't try for s3 but code below creates an "abc" directory under "/tmp" on hdfs.

    import os
    tables_list = ['abc']
    for table_name in tables_list:
        df.write.parquet(os.path.join("/tmp",table_name))