Search code examples
pysparkazure-synapse

How to display the size of each record of a PySpark Dataframe?


We read a parquet file into a pyspark dataframe and load it into Synapse. But apparently, our dataframe is having records that exceed the 1MB limit on Synapse (polybase). Our databricks ingestion scripts keep throwing the below error:

The size of the schema/row at ordinal 'n' exceeds the maximum allowed row size of 1000000 bytes.

I'm trying to find out which row in my dataframe has this issue but I'm unable to identify the faulty row.

I was able to print the length of each column of a dataframe but how do I print the size of each record?

Is there a way to do this? Can someone please help?


Solution

  • Use below code to get size of each row.

    import sys
    rows = df.collect()
    for rw in rows:
        print(str((sys.getsizeof(''.join(rw[0:]))))+" bytes")
    

    This gives you size in bytes.

    enter image description here

    After getting this, check which record has more size.