I have a csv file with the column names in the first row. Unfortunately some of the fields have square brackets and spaces in them. Synapse is failing to load with this error:
AnalysisException: Attribute name "xxxxx [xxxxxxxx]" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
I have looked at spark-dataframe-column-naming-conventions-restrictions which provides some advice on how to restate the column names, but the data is already in a dataframe.
I would appreciate some recommendations on how to approach this issue in pyspark
Thanks
You can use the same code in the mentioned SO link by pythonsherpa. It will not affect the data in the data frame. This will only replace the column names.
newColumns = []
problematic_chars = ',;{}()='
for column in df.columns:
column = column.lower()
column = column.replace(' ', '_')
for c in problematic_chars:
column = column.replace(c, '')
newColumns.append(column)
df = df.toDF(*newColumns)
display(df.limit(10))
I tried to replace the invalid characters with this code and sample input csv file in my environment.
Sample Input:
img:1 sample input dataframe
col 3
has space and col;4
has semicolon ;. The above code will replace them.Result
img:2 Dataframe with Transformed column names.