Search code examples
dataframescalaapache-spark

How to transform 1 column argument and 1 values ​argument in a dataframe in scala?


I'm having difficulty creating the dataframe from an arg of column names and an arg with scala values.

It doesn't recognize spark.createDataFrame(dataClient).toDF(columns. _*)


val dataClient = args(0).split(",").map(_.trim)
val column = args(1).split(",").map(_.trim)

val dataClient = Seq(dataClient)
val df: DataFrame = spark.createDataFrame(dataClient).toDF(columns. _*)


Can I do it another way without needing a lib?

Remembering that I can't use the cloud, so import services are very limited.

A way to transform a column argument and another value argument into a dataframe


Solution

  • Per Gaël J's comment did you mean to use the "." with columns.* , the fact you c+p'd it would seem so. The syntax is :* not ._*.

    Also:

    val dataClient = args(0).split(",").map(_.trim)
    val column = args(1).split(",").map(_.trim)
    
    val dataClient = Seq(dataClient)
    val df: DataFrame = spark.createDataFrame(dataClient).toDF(columns. _*)
    

    line 1 declares dataClient as an Array[String], line 3 tries to redeclare dataClient as a Seq[Array[String]] with a single entry. You declare column and then use column"s".

    Finally createDataFrame would already create a DataFrame, so toDF doesn't make sense here.

    It's not clear what you are trying to do but this code, even after those corrections but I'd assume it's this:

    import spark.implicits._
    val args = Seq("a,b,c,d","colname")
    
    val dataClient = args(0).split(",").map(_.trim).toSeq
    val column = args(1).trim
    
    val df: DataFrame = dataClient.toDF(column)
    df.show()
    

    yielding:

    +-------+
    |colname|
    +-------+
    |      a|
    |      b|
    |      c|
    |      d|
    +-------+
    

    The "import spark.implicits._" brings into scope all the implicit machinery needed to call DF and ".toSeq" allows that machinery to see a Seq[String] rather than an Array[String] to toDF can work.

    IF you are trying to get multiple columns through it's not so straightforward as each "row" would be treated as an array if you use something like this:

    import spark.implicits._
    val args = Seq("a,b, c, d | e,f ,g, h ","col1, col2, col3 , col4")
    
    val dataClient = args(0).split("\\|").map(_.trim).map(_.split(",").map(_.trim) ).toSeq
    val columns = args(1).split(",").map(_.trim)
    
    val df: DataFrame = dataClient.toDF(columns :_ *)
    df.show()
    

    which yields:

    java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
    Old column names (1): value
    New column names (4): col1, col2, col3, col4
    
        at scala.Predef$.require(Predef.scala:281)
    

    To make this work it would require making rows directly:

    val args = Seq("a,b, c, d | e,f ,g, h ","col1, col2, col3 , col4")
    
    val columns = args(1).split(",").map(_.trim)
    val schema = StructType(columns.map(c => StructField(c, StringType)))
    
    val dataClient = args(0).split("\\|").map(_.trim).map{ raw =>
      val rawSeq =
        raw.split(",").map(_.trim)
      new GenericRowWithSchema(rawSeq.asInstanceOf[ Array[Any] ], schema) : Row
    }.toSeq
    
    implicit val rowEnc = RowEncoder(schema)
    
    val df = spark.createDataset(dataClient)
    df.show()
    

    resulting in:

    +----+----+----+----+
    |col1|col2|col3|col4|
    +----+----+----+----+
    |   a|   b|   c|   d|
    |   e|   f|   g|   h|
    +----+----+----+----+
    

    but this has a few things you should avoid in it:

    new GenericRowWithSchema(rawSeq.asInstanceOf[ Array[Any] ], schema) : Row
    

    whilst not being "internal" to Spark it's definitely not intended for you to use Spark like this.

    If this is a learning experience I'd suggest making a csv file and reading that instead of trying to construct a dataframe like this, if not then I'd recommend you use a file exchange like parquet which includes the schema and doesn't come with parsing concerns like actually wanting a "," in your field (or | or whatever else you try to use as a separator). If it really must be text then look at json which would also allow some structures to be passed in (although not map's with non-string keys).