Search code examples
scalaapache-sparktwitterspark-streamingapache-spark-sql

data lost when converting DStream to Dataframe


I have an issue when i try to convert my DStream[String] into Dataframes.

My goal is to transform the twitter stream[rdd] into dataframes, but with my code (below) the transformation doesn't work, at the end i receive i dataframe with only one word.

For example :hi every body

my dataframe will contain only the words "hi"

here the piece of the code

val splited_test=texts.transform(rdd => rdd.map(x=> Row.fromSeq(x.split(" "))))


    splited_test.foreachRDD { rdd =>{

      val fields = new Array[StructField](1)
      fields(0)=(DataTypes.createStructField("text", StringType, true))
      val schema = DataTypes.createStructType(fields)
      val df= sqlContext.createDataFrame(rdd, schema)
}}

Solution

  • Only the first word is stored because you used x.split (" ").

    You created one field.

    Modify the code as follows.

    val splited_test=texts.transform(rdd => rdd.map(x=> Row.fromSeq(Seq(x))))