Search code examples

Dummy Encoding using Pyspark

enter image description here

I am hoping to dummy encode my categorical variables to numerical variables like shown in the image below, using Pyspark syntax.

I read in data like this

data ="data.txt", sep = ";", header = "true")

In python I am able to encode my variables using the below code

data = pd.get_dummies(data, columns = ['Continent'])

However I am not sure how to do it in Pyspark.

Any assistance would be greatly appreciated.


  • Try this:

    import pyspark.sql.functions as F 
    categ ='Continent').distinct().rdd.flatMap(lambda x:x).collect()
    exprs = [F.when(F.col('Continent') == cat,1).otherwise(0)\
                .alias(str(cat)) for cat in categ]
    df =

    Exclude df.columns if you do not want the original columns in your transformed dataframe.