Search code examples
pythondataframepysparkbokehline-plot

Issues in displaying year by year line plot using bokeh (pyspark dataframe)


I have an issue when I try to displaying line plot from pyspark dataframe using bokeh, it was displayed successfully but it shows what I'm not expected. The problem is some line plot dots are not connected year by year sequentially.

Previously I tried to sort the source data frame using orderBy :

# Join df_max, and df_avg to df_quake_freq    
df_quake_freq = df_quake_freq.join(df_avg, ['Year']).join(df_max, ['Year'])    
df_quake_freq = df_quake_freq.orderBy(asc('Year'))    
df_quake_freq.show(5)

And the output is : dataframe source for line plot

This is code for plotting:

    # Create a magnitude plot
def plotMagnitude():
    # Load the datasource
    cds = ColumnDataSource(data=dict(
        yrs = df_quake_freq['Year'].values.tolist(),
        avg_mag = df_quake_freq['Avg_Magnitude'].round(1).values.tolist(),
        max_mag = df_quake_freq['Max_Magnitude'].values.tolist()
    ))
    
    # Tooltip
    TOOLTIPS = [
        ('Year', ' @yrs'),
        ('Average Magnitude', ' @avg_mag'),
        ('Maximum Magnitude', ' @max_mag')
    ]
    
    # Create the figure
    mp = figure(title='Maximum and Average Magnitude by Year',
               plot_width=1150, plot_height=400,
               x_axis_label='Years',
               y_axis_label='Magnitude',
               x_minor_ticks=2,
               y_range=(5, df_quake_freq['Max_Magnitude'].max() + 1),
               toolbar_location=None,
               tooltips=TOOLTIPS)
    
    # Max Magnitude
    mp.line(x='yrs', y='max_mag', color='#cc0000', line_width=2, legend='Max Magnitude', source=cds)
    mp.circle(x='yrs', y='max_mag', color='#cc0000', size=8, fill_color='#cc0000', source=cds)
    
    # Average Magnitude 
    mp.line(x='yrs', y='avg_mag', color='yellow', line_width=2, legend='Avg Magnitude', source=cds)
    mp.circle(x='yrs', y='avg_mag', color='yellow', size=8, fill_color='yellow', source=cds)
    
    mp = style(mp)
    
    show(mp)
    
    return mp

plotMagnitude()

The output is : line plot

We can see from the picture that some dots are not connected sequentially, for example like 1965-1967-1966


Solution

  • My problem is solved, the reason why the dots are not sequenced is that plotMagnitude function is relying on global variables, so to handle this I put sort.values() inside the function. This is the syntax:

    def plotMagnitude():
    # Load the datasource
    cds = ColumnDataSource(data=dict(
        yrs = df_quake_freq['Year'].sort_values().values.tolist(),
        avg_mag = df_quake_freq['Avg_Magnitude'].round(1).values.tolist(),
        max_mag = df_quake_freq['Max_Magnitude'].values.tolist()