Search code examples
pythondataframepysparkrowcountpyspark-schema

Pyspark: Adding row/column with single value of row counts


I have a pyspark dataframe that I'd like to get the row count for. Once I get the row count, I'd like to add it to the top left corner of the data frame, as shown below.

I've tried creating the row first and doing a union on the empty row and the dataframe, but the empty row gets overwritten. I've tried adding it as a literal in a column, but having trouble nulling the remainder of the column as well as the row. Any advice?

dataframe:

col1 col2 col3 ... col13
string string timest ... int

for a few rows.

desired output:

row_count col1 col2 col3 ... col13
numofrows
string string timest ... int

So the row count would sit where an otherwise empty row and empty column meet.


Solution

  • Assuming df is your dataframe:

    from pyspark.sql import functions as F
    
    cnt = df.count()
    
    columns_list = df.columns
    
    df = df.withColumn("row_count", F.lit(None).cast("int"))
    schema = df.schema
    
    cnt_line = spark.createDataFrame([[None for x in columns_list] + [cnt]], schema=schema)
    
    df.unionAll(cnt_line).show()