Search code examples
pysparkapache-spark-sqlmultiple-columnsrename

Pyspark automatically rename repeated columns


I want to automatically rename repeated columns of a df. For example:

df 
Out[4]: DataFrame[norep1: string, num1: string, num1: bigint, norep2: bigint, num1: bigint, norep3: bigint]

Apply some function to end with a df like:

f_rename_repcol(df) 
Out[4]: DataFrame[norep1: string, num1_1: string, num1_2: bigint, norep2: bigint, num1_3: bigint, norep3: bigint]

I've already create my own function, and works, but I'm sure there is a shorter and better way of doing it:

def f_df_col_renombra_rep(df):
    from collections import Counter
    from itertools import chain
    import pandas as pd

    columnas_original = np.array(df.columns)
    d1 = Counter(df.columns)
    i_corrige = [a>1 for a in dict(d1.items()).values()]
    var_corrige = np.array(dict(d1.items()).keys())[i_corrige]
    var_corrige_2 = [a for a in columnas_original if a in var_corrige]
    columnas_nuevas = []
    for var in var_corrige:
        aux_corr = [a for a in var_corrige_2 if a in var]
        i=0
        columnas_nuevas_aux=[]
        for valor in aux_corr:
            i+=1
            nombre_nuevo = valor +"_"+ str(i)
            columnas_nuevas_aux.append(nombre_nuevo)
        columnas_nuevas.append(columnas_nuevas_aux)
    columnas_nuevas=list(chain.from_iterable(columnas_nuevas))
    indice_cambio = pd.Series(columnas_original).isin(var_corrige)
    i = 0
    j = 0
    colsalida = [None]*len(df.columns)
    for col in df.columns:
        if indice_cambio[i] == True:
            colsalida[i] = columnas_nuevas[j]
            j += 1
        else:
            colsalida[i] = col
            # no cambio el nombre
        i += 1

    df_out = df.toDF(*(colsalida))

    return  df_out

Solution

  • You can modify the renaming function here to suit your need, but broadly I find this as the best way to rename all the duplicates columns

    old_col=df.schema.names
    running_list=[]
    new_col=[]
    i=0
    for column in old_col:
        if(column in running_list):
            new_col.append(column+"_"+str(i))
            i=i+1
        else:
            new_col.append(column)
            running_list.append(column)
    print(new_col)
    

    This the conversion I do, the suffix assigned to the duplicate columns is not that of difference until the name(prefix) remains the same & I can save the file.

    To update the columns you can simply run:

    df=df.toDF(*new_col)
    

    This should update the column names and remove all the duplicates

    If you want to keep the numbering as _1,_2,_3: You can use a dictionary and try and except block,

    dict={}
    for column in old_col:
        try:
            i=dict[column]+1
            new_col.append(column+"_"+str(i))
            dict[column]=i
        except:
            dict[column]=1
            new_col.append(column+"_"+str(1)
    print(new_col)