How to transform 1 column argument and 1 values argument in a dataframe in scala?

I'm having difficulty creating the dataframe from an arg of column names and an arg with scala values.

It doesn't recognize spark.createDataFrame(dataClient).toDF(columns. _*)


val dataClient = args(0).split(",").map(_.trim)
val column = args(1).split(",").map(_.trim)

val dataClient = Seq(dataClient)
val df: DataFrame = spark.createDataFrame(dataClient).toDF(columns. _*)

Can I do it another way without needing a lib?

Remembering that I can't use the cloud, so import services are very limited.

A way to transform a column argument and another value argument into a dataframe

Solution

Per Gaël J's comment did you mean to use the "." with columns.* , the fact you c+p'd it would seem so. The syntax is :* not ._*.

Also:

val dataClient = args(0).split(",").map(_.trim)
val column = args(1).split(",").map(_.trim)

val dataClient = Seq(dataClient)
val df: DataFrame = spark.createDataFrame(dataClient).toDF(columns. _*)

line 1 declares dataClient as an Array[String], line 3 tries to redeclare dataClient as a Seq[Array[String]] with a single entry. You declare column and then use column"s".

Finally createDataFrame would already create a DataFrame, so toDF doesn't make sense here.

It's not clear what you are trying to do but this code, even after those corrections but I'd assume it's this:

import spark.implicits._
val args = Seq("a,b,c,d","colname")

val dataClient = args(0).split(",").map(_.trim).toSeq
val column = args(1).trim

val df: DataFrame = dataClient.toDF(column)
df.show()

yielding:

+-------+
|colname|
+-------+
|      a|
|      b|
|      c|
|      d|
+-------+

The "import spark.implicits._" brings into scope all the implicit machinery needed to call DF and ".toSeq" allows that machinery to see a Seq[String] rather than an Array[String] to toDF can work.

IF you are trying to get multiple columns through it's not so straightforward as each "row" would be treated as an array if you use something like this:

import spark.implicits._
val args = Seq("a,b, c, d | e,f ,g, h ","col1, col2, col3 , col4")

val dataClient = args(0).split("\\|").map(_.trim).map(_.split(",").map(_.trim) ).toSeq
val columns = args(1).split(",").map(_.trim)

val df: DataFrame = dataClient.toDF(columns :_ *)
df.show()

which yields:

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): value
New column names (4): col1, col2, col3, col4

    at scala.Predef$.require(Predef.scala:281)

To make this work it would require making rows directly:

val args = Seq("a,b, c, d | e,f ,g, h ","col1, col2, col3 , col4")

val columns = args(1).split(",").map(_.trim)
val schema = StructType(columns.map(c => StructField(c, StringType)))

val dataClient = args(0).split("\\|").map(_.trim).map{ raw =>
  val rawSeq =
    raw.split(",").map(_.trim)
  new GenericRowWithSchema(rawSeq.asInstanceOf[ Array[Any] ], schema) : Row
}.toSeq

implicit val rowEnc = RowEncoder(schema)

val df = spark.createDataset(dataClient)
df.show()

resulting in:

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   a|   b|   c|   d|
|   e|   f|   g|   h|
+----+----+----+----+

but this has a few things you should avoid in it:

new GenericRowWithSchema(rawSeq.asInstanceOf[ Array[Any] ], schema) : Row

whilst not being "internal" to Spark it's definitely not intended for you to use Spark like this.

If this is a learning experience I'd suggest making a csv file and reading that instead of trying to construct a dataframe like this, if not then I'd recommend you use a file exchange like parquet which includes the schema and doesn't come with parsing concerns like actually wanting a "," in your field (or | or whatever else you try to use as a separator). If it really must be text then look at json which would also allow some structures to be passed in (although not map's with non-string keys).

How to transform 1 column argument and 1 values ​argument in a dataframe in scala?

How to transform 1 column argument and 1 values argument in a dataframe in scala?