Search code examples
pythondataframeapache-sparkpysparkfill

Make a not available column in PySpark dataframe full of zero


I have a script which is currently running in sample dataframe. Here's my code:

fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count().select('msisdn', *parameter_cut.columns).fillna(0)

After pivoting, probably some columns in parameter_cut are not available on df fd_orion_apps, and it will give the error like this:

Py4JJavaError: An error occurred while calling o1381.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`8602`' given input columns: [7537, 7011, 2658, 3582, 12120, 31049, 35010, 16615, 10003, 15067, 1914, 1436, 6032, 422, 10636, 10388, 877,...


Solution

  • You can separate the select into a different step. Then you will be able to use a conditional expression together with list comprehension.

    from pyspark.sql import functions as F
    
    fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count()
    fd_orion_apps = fd_orion_apps.select(
        'msisdn',
        *[c if c in fd_orion_apps.columns else F.lit(0).alias(c) for c in parameter_cut.columns]
    )