Search code examples
pythonpandasgoogle-colaboratoryhvplot

HV Plot: Plotting multiple lines with null values


I have a DataFrame I am trying to graph using HV Plot.

So far, I have something like this:

new_df = new_df.dropna(subset=['Reflectance'])
new_df = new_df.sort_values(by='Wavelength')

reflectance_plot = new_df.hvplot.line(x = "Wavelength",y = "Reflectance", by="UniqueID", legend=False).opts(fontsize={'title': 16, 'labels': 14, 'yticks': 12},xrotation=45, xticks=15)
reflectance_plot

Which gives me something like this: enter image description here

As you can see, between the smooth areas with data, there are lots of straight lines where there are no values. I am trying to remove these straight lines so that only the data is plotted. I tried to do that with this code:

new_df['Reflectance'] = new_df['Reflectance'].fillna(np.nan).replace([np.nan], [None])
new_df = new_df.sort_values(by='Wavelength')
    
reflectance_plot = new_df.hvplot.line(x = "Wavelength",y = "Reflectance", by="UniqueID", legend=False).opts(fontsize={'title': 16, 'labels': 14, 'yticks': 12},xrotation=45, xticks=15)
reflectance_plot

Which leaves me with: enter image description here

So obviously this is what I am trying to accomplish, except now the vast majority of the data is completely gone. I would appreciate any advice or insight onto why this is happening and how to fix it.


Solution

  • I came across a similar issue, and what I came up with was the following:

    Generate & plot some problematic data:

    import pandas as pd
    import numpy as np
    import hvplot.pandas
    
    df = pd.DataFrame({'data1':np.random.randn(22),
                       'data2':np.random.randn(22)+3})
    
    df['time'] = pd.to_datetime('2022-12-25T09:00') + \
                 np.cumsum(([pd.Timedelta(1, unit='h')]*5 +
                           [pd.Timedelta(30, unit='h')] + # <-- big 'Ol gap in the data
                           [pd.Timedelta(1, unit='h')]*5)*2)
    
    df.set_index('time', inplace=True)
    df.hvplot()
    

    Which plots something like the following - where the gaps in the data is hopefully obvious (but won't always be): plot with gaps

    So the approach is to find gaps in your data which are unacceptably long. This will be context-specific. In the data above good data is 1h apart, and the gaps is 30h - so I use a max acceptable gap of 2h:

    # Insert NA just after any gaps which are unacceptably long:
    dt_max_acceptable = pd.Timedelta(2, unit='h')
    
    df['dt'] = df.index.to_series().diff()
    t_at_end_of_gaps = df[df.dt > dt_max_acceptable].index.values
    t_before_end_of_gaps = [i - pd.Timedelta(1) for i in t_at_end_of_gaps]
    
    for t in t_before_end_of_gaps:
        df.loc[t] = pd.NA
        
    df.sort_index(inplace=True)
    df.hvplot()
    

    Which should plot something like this - showing that the line no longer spans the gaps which are 'too long':

    plot where line no longer spans the gaps

    The approach is quite easy to apply - and works for my purposes. The down side is that it's adding artificial rows with NaN data in them - which might not always be acceptable.