I have a Pyspark Dataframe having 100 columns (shown only 5 below for explaining):
I need to re-arrange the index of around 30 specific columns only, and leave the rest as it is. The sequence (i.e. index) in which the specific columns need to be arranged is listed out in an Excel table as below (shown only 3 below for explaining):
Now, I could have used df = df.select('C', 'E', 'B',...and so on)
But it becomes too tedious to write down all 100 column names in the correct sequence above.
So is there any efficient way to do this by simply reading the sequence from the table as a list and using it as a reference to do the re-arrangement of columns?
To continue with the example to be clear, if there were just these 5 columns out of which 3 had to be re-arranged, then the output would like this:
Note: I'm using Python 2.7.5 & Spark 2.4.0
you can access all the columns with the property df.columns
(which is a python list) and simply play with it as a list comprehension :
# assuming you can create the object sequence from the excel
sequence = ["C", "E", "B"]
df = df.select(sequence + [col for col in df.columns if col not in sequence])
df
> DataFrame[C: string, E: string, B: string, A: string, D: string]