I have an issue when I try to displaying line plot from pyspark dataframe using bokeh, it was displayed successfully but it shows what I'm not expected. The problem is some line plot dots are not connected year by year sequentially.
Previously I tried to sort the source data frame using orderBy :
# Join df_max, and df_avg to df_quake_freq
df_quake_freq = df_quake_freq.join(df_avg, ['Year']).join(df_max, ['Year'])
df_quake_freq = df_quake_freq.orderBy(asc('Year'))
df_quake_freq.show(5)
And the output is : dataframe source for line plot
This is code for plotting:
# Create a magnitude plot
def plotMagnitude():
# Load the datasource
cds = ColumnDataSource(data=dict(
yrs = df_quake_freq['Year'].values.tolist(),
avg_mag = df_quake_freq['Avg_Magnitude'].round(1).values.tolist(),
max_mag = df_quake_freq['Max_Magnitude'].values.tolist()
))
# Tooltip
TOOLTIPS = [
('Year', ' @yrs'),
('Average Magnitude', ' @avg_mag'),
('Maximum Magnitude', ' @max_mag')
]
# Create the figure
mp = figure(title='Maximum and Average Magnitude by Year',
plot_width=1150, plot_height=400,
x_axis_label='Years',
y_axis_label='Magnitude',
x_minor_ticks=2,
y_range=(5, df_quake_freq['Max_Magnitude'].max() + 1),
toolbar_location=None,
tooltips=TOOLTIPS)
# Max Magnitude
mp.line(x='yrs', y='max_mag', color='#cc0000', line_width=2, legend='Max Magnitude', source=cds)
mp.circle(x='yrs', y='max_mag', color='#cc0000', size=8, fill_color='#cc0000', source=cds)
# Average Magnitude
mp.line(x='yrs', y='avg_mag', color='yellow', line_width=2, legend='Avg Magnitude', source=cds)
mp.circle(x='yrs', y='avg_mag', color='yellow', size=8, fill_color='yellow', source=cds)
mp = style(mp)
show(mp)
return mp
plotMagnitude()
The output is : line plot
We can see from the picture that some dots are not connected sequentially, for example like 1965-1967-1966
My problem is solved, the reason why the dots are not sequenced is that plotMagnitude
function is relying on global variables, so to handle this I put
sort.values()
inside the function. This is the syntax:
def plotMagnitude():
# Load the datasource
cds = ColumnDataSource(data=dict(
yrs = df_quake_freq['Year'].sort_values().values.tolist(),
avg_mag = df_quake_freq['Avg_Magnitude'].round(1).values.tolist(),
max_mag = df_quake_freq['Max_Magnitude'].values.tolist()