Search code examples
pysparkparquetazure-synapse

Synapse Notebook - Reading CSV file with restricted characters in column names into Parquet files


I have a csv file with the column names in the first row. Unfortunately some of the fields have square brackets and spaces in them. Synapse is failing to load with this error:

AnalysisException: Attribute name "xxxxx [xxxxxxxx]" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.

I have looked at spark-dataframe-column-naming-conventions-restrictions which provides some advice on how to restate the column names, but the data is already in a dataframe.

I would appreciate some recommendations on how to approach this issue in pyspark

Thanks


Solution

  • You can use the same code in the mentioned SO link by pythonsherpa. It will not affect the data in the data frame. This will only replace the column names.

    newColumns = []
    problematic_chars = ',;{}()='
    for column in df.columns:
        column = column.lower()
        column = column.replace(' ', '_')
        for c in problematic_chars:
            column = column.replace(c, '')
        newColumns.append(column)
    df = df.toDF(*newColumns)
    display(df.limit(10))
    

    I tried to replace the invalid characters with this code and sample input csv file in my environment.

    Sample Input:

    enter image description here img:1 sample input dataframe

    • col 3 has space and col;4 has semicolon ;. The above code will replace them.

    Result enter image description here img:2 Dataframe with Transformed column names.