I am new to spark.
I want to select column name and its maximum length of spark data frame in java.
I am using spark. I find few but those are in scala and python and those are not working in spark. The input and expected out are as below.
I would like to find a length of the longest element in each column.
I try
df.select(Arrays.stream(df.columns().map(colname - > df.agg(max(length(col(colname))).head().get(0));
But not sure how to get data frame with column names and it's maximum length.
Regards, Pramod
You were close
SparkSession spark = SparkSession.builder()
.config(new SparkConf()
.setAppName("test")
.setMaster("local[*]"))
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[]{
createStructField("col1", DataTypes.StringType, true),
createStructField("col2", DataTypes.StringType, true),
createStructField("col3", DataTypes.StringType, true)
});
List<Row> rows = Arrays.asList(
RowFactory.create("aaa", "b","cc"),
RowFactory.create("a", "bbbbbbb", "c"),
RowFactory.create("aa", "bbb", "ccccc")
);
Dataset df = spark.createDataFrame(rows, schema);
df.show();
Dataset resultSingleRow = df.select(Arrays.stream(df.columns()).map(colname -> max(length(col(colname))).as(colname)).toArray(Column[]::new));
resultSingleRow.show();
// spark 3.4+
Dataset resultMultipleRows = resultSingleRow.unpivot(new Column[]{}, "column-name", "length");
resultMultipleRows.show();
Output:
+----+-------+-----+
|col1| col2| col3|
+----+-------+-----+
| aaa| b| cc|
| a|bbbbbbb| c|
| aa| bbb|ccccc|
+----+-------+-----+
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 3| 7| 5|
+----+----+----+
+-----------+------+
|column-name|length|
+-----------+------+
| col1| 3|
| col2| 7|
| col3| 5|
+-----------+------+