Search code examples
javadataframeapache-sparkaggregate

Selecting column and it's maximum length of the data frame using spark in java


I am new to spark.

I want to select column name and its maximum length of spark data frame in java.

I am using spark. I find few but those are in scala and python and those are not working in spark. The input and expected out are as below.

I would like to find a length of the longest element in each column.

I try

df.select(Arrays.stream(df.columns().map(colname - > df.agg(max(length(col(colname))).head().get(0));

But not sure how to get data frame with column names and it's maximum length.

Regards, Pramod


Solution

  • You were close

        SparkSession spark =  SparkSession.builder()
                .config(new SparkConf()
                .setAppName("test")
                .setMaster("local[*]"))
                .getOrCreate();
    
        StructType schema = DataTypes.createStructType(new StructField[]{
                createStructField("col1", DataTypes.StringType, true),
                createStructField("col2", DataTypes.StringType, true),
                createStructField("col3", DataTypes.StringType, true)
        });
    
        List<Row> rows = Arrays.asList(
                RowFactory.create("aaa", "b","cc"),
                RowFactory.create("a", "bbbbbbb", "c"),
                RowFactory.create("aa", "bbb", "ccccc")
        );
    
        Dataset df = spark.createDataFrame(rows, schema);
        df.show();
    
    
        Dataset resultSingleRow = df.select(Arrays.stream(df.columns()).map(colname -> max(length(col(colname))).as(colname)).toArray(Column[]::new));
        resultSingleRow.show();
    
        // spark 3.4+
        Dataset resultMultipleRows = resultSingleRow.unpivot(new Column[]{}, "column-name", "length");
        resultMultipleRows.show();
    

    Output:

    +----+-------+-----+
    |col1|   col2| col3|
    +----+-------+-----+
    | aaa|      b|   cc|
    |   a|bbbbbbb|    c|
    |  aa|    bbb|ccccc|
    +----+-------+-----+
    
    
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   3|   7|   5|
    +----+----+----+
    
    
    +-----------+------+
    |column-name|length|
    +-----------+------+
    |       col1|     3|
    |       col2|     7|
    |       col3|     5|
    +-----------+------+