I am using Apache Zeppelin (0.9.0) and Scala (2.11.12). I want to pull some data out of a dataframe and store it to InfluxDB, later to be visualized in Grafana, and cannot figure it out. I'm trying a naive approach with a foreach
loop. The idea is to iterate through all rows, extract the columns I need, create a Point object (from this InfluxDB client library), and either send it to InfluxDB or add it to a list and then send all the points in bulk, after the loop.
The dataframe looks like this:
+---------+---------+-------------+-----+
|timestamp|sessionId|eventDuration|speed|
+---------+---------+-------------+-----+
| 1| ses1| 0.0| 50|
| 2| ses1| 1.0| 50|
| 3| ses1| 2.0| 50|
I've tried to do what is described above:
import scala.collection.mutable.ListBuffer
import spark.implicits._
import org.apache.spark.sql._
import com.paulgoldbaum.influxdbclient._
import scala.concurrent.ExecutionContext.Implicits.global
val influxdb = InfluxDB.connect("172.17.0.4", 8086)
val database = influxdb.selectDatabase("test")
var influxData = new ListBuffer[Point]()
dfAnalyseReport.foreach(row =>
{
val point = Point("acceleration")
.addTag("speedBin", row.getLong(3).toString)
.addField("eventDuration", row.getDouble(2))
influxData += point
}
)
val influxDataList = influxData.toList
database.bulkWrite(influxDataList)
The only thing I am getting here is a cryptic java.lang.ClassCastException
with no additional info, neither in the notebook output nor in the logs of the Zeppelin Docker container. The error seems to be somewhere in the foreach, as it appears even when I comment out the last two lines.
I also tried adapting approach 1 from this answer, using a case class for columns, but to no avail. I got it to run without an error, but the resulting list was empty. Unfortunately I deleted that attempt. I could reconstruct it if necessary, but I've spent so much time on this I'm fairly certain I have some fundamental misunderstanding on how this should be done.
One further question: I also tried writing each Point to the DB as it was constructed (instead of in bulk). The only difference is that instead of appending to the ListBuffer
I did a database.write(point)
operation. When done outside of the loop with a dummy point, it goes through without a problem - the data ends up in InfluxDB - but inside the loop it results in org.apache.spark.SparkException: Task not serializable
Could someone point me in the right way? How should I tackle this?
I'd do it with the RDD map method and collect the results to a list:
val influxDataList = dfAnalyseReport.rdd.map(
row => Point("acceleration")
.addTag("speedBin", row.getInt(3).toString)
.addField("eventDuration", row.getDouble(2))
).collect.toList