python apache-spark bokeh apache-zeppelin pandas-bokeh

How to build a values density heatmap in Bokeh for timed window occurencies calculated in Spark?

According to https://stackoverflow.com/a/48692943/1759063 it is possible to aggregate the occurence on values per time unit like that:

+---------+----------+------------------------+------------+------+
|device_id|read_date |ids                     |counts      |top_id|
+---------+----------+------------------------+------------+------+
|device_A |2017-08-05|[4041]                  |[3]         |4041  |
|device_A |2017-08-06|[4041, 4041]            |[3, 3]      |4041  |
|device_A |2017-08-07|[4041, 4041, 4041]      |[3, 3, 4]   |4041  |
|device_A |2017-08-08|[4041, 4041, 4041]      |[3, 4, 3]   |4041  |
|device_A |2017-08-09|[4041, 4041, 4041]      |[4, 3, 3]   |4041  |
|device_A |2017-08-10|[4041, 4041, 4041, 4045]|[3, 3, 1, 2]|4041  |
|device_A |2017-08-11|[4041, 4041, 4045, 4045]|[3, 1, 2, 3]|4045  |
|device_A |2017-08-12|[4041, 4045, 4045, 4045]|[1, 2, 3, 3]|4045  |
|device_A |2017-08-13|[4045, 4045, 4045]      |[3, 3, 3]   |4045  |
+---------+----------+------------------------+------------+------+

I'd like to plot that in Zeppelin with X being read_time, Y being integer ID value and counts turn it into heatmap. How I can plot that with Bokeh and pandas?

Solution

This kind of DataFrame is based on more plain DataFrame where ids and counts are not grouped to arrays. It is more convenient to use non grouped DataFrame to build that with Bokeh:

https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8

instead of grouped to list columns ids/counts we have raw table with one line per unique id ('value') and value of count ('index') and each line has its 'write_time'

rowIDs = pdf['values']
colIDs = pdf['window_time']

A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)

source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
                           ,'y':[0] #bottom most 
                           ,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
                           #,'dh':[pdf['delayWindowEnd'].max()] #TOTAL height of image
                           ,'dh':[1000] #TOTAL height of image
                           ,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
                           })


color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)

plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh',  color_mapper=color_mapper)

color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)

plot.add_layout(color_bar, 'right')

#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, high=pdf['index'].max()))

And the result: