Search code examples
pythonpysparkcomments

Put comments in between multi-line statement (with line continuation)


When i write a following pyspark command:

# comment 1
df = df.withColumn('explosion', explode(col('col1'))).filter(col('explosion')['sub_col1'] == 'some_string') \
    # comment 2
    .withColumn('sub_col2', from_unixtime(col('explosion')['sub_col2'])) \
    # comment 3
    .withColumn('sub_col3', from_unixtime(col('explosion')['sub_col3']))

I get the following error:

.withColumn('sub_col2', from_unixtime(col('explosion')['sub_col2']))
^
IndentationError: unexpected indent

Is there a way to write comments in between the lines of multiple-line commands in pyspark?


Solution

  • This is not a pyspark issue, but rather a violation of python syntax.

    Consider the following example:

    a, b, c = range(3)
    a +\
    # add b
    b +\
    # add c
    c
    

    This results in:

        a +# add b
                  ^
    SyntaxError: invalid syntax
    

    The \ is a continuation character and python interprets anything on the next line as occurring immediately after, causing your error.

    One way around this is to use parentheses instead:

    (a +
    # add b
    b +
    # add c
    c)
    

    When assigning to a variable this would look like

    # do a sum of 3 numbers
    addition = (a +
                # add b
                b +
                # add c
                c)
    

    Or in your case:

    # comment 1
    df = (df.withColumn('explosion', explode(col('col1')))
        .filter(col('explosion')['sub_col1'] == 'some_string')
        # comment 2
        .withColumn('sub_col2', from_unixtime(col('explosion')['sub_col2']))
        # comment 3
        .withColumn('sub_col3', from_unixtime(col('explosion')['sub_col3'])))