Search code examples
pythonapache-sparkpysparkregex-replace

How to replace string in column names of pyspark dataframe?


I have a pyspark data frame that every column appends the table name ie: Table.col1, Table.col2...

I would like to replace 'Table.' with '' (nothing) in every column in my dataframe.

How do I do this? Everything I have found deals with doing this to the values in the columns and not the column names themselves.


Solution

  • One option is to use toDF with replace :

    DataFrame.toDF(*cols)
    Returns a new DataFrame that with new specified column names

    out = df.toDF(*[c.replace("Table.", "") for c in df.columns])
    

    Output :

    out.show()
    +----+----+
    |col1|col2|
    +----+----+
    | foo|   1|
    | bar|   2|
    +----+----+
    

    Input used :

    +----------+----------+
    |Table.col1|Table.col2|
    +----------+----------+
    |       foo|         1|
    |       bar|         2|
    +----------+----------+