Search code examples
apache-sparkpysparkapache-spark-sqlazure-databricks

pyspark max string length for each column in the dataframe


I am trying this in databricks . Please let me know the pyspark libraries needed to be imported and code to get the below output in Azure databricks pyspark

example:- input dataframe :-

|     column1     |    column2    | column3  |  column4  |

| a               | bbbbb         | cc       | >dddddddd |
| >aaaaaaaaaaaaaa | bb            | c        | dddd      |
| aa              | >bbbbbbbbbbbb | >ccccccc | ddddd     |
| aaaaa           | bbbb          | ccc      | d         |

output dataframe :-

| column  | maxLength |

| column1 |        14 |
| column2 |        12 |
| column3 |         7 |
| column4 |         8 |

Solution

  • >>> from pyspark.sql import functions as sf
    >>> df = sc.parallelize([['a','bbbbb','ccc','ddd'],['aaaa','bbb','ccccccc', 'dddd']]).toDF(["column1", "column2", "column3", "column4"])
    >>> df1 = df.select([sf.length(col).alias(col) for col in df.columns])
    >>> df1.groupby().max().show()
    +------------+------------+------------+------------+
    |max(column1)|max(column2)|max(column3)|max(column4)|
    +------------+------------+------------+------------+
    |           4|           5|           7|           4|
    +------------+------------+------------+------------+
    

    then use this link to melt previous dataframe

    Edit: (From Iterate through each column and find the max length)

    Single line select

    from pyspark.sql.functions import col, length, max
    
    df=df.select([max(length(col(name))).alias(name) for name in df.schema.names])
    

    Output Output

    As Rows

    df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
    row=df.first().asDict()
    df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
    

    Output:

    Output