Search code examples
pythonpandasdataframeheatmapmissing-data

Visualise missing values in a time series heatmap


I am really new in big data analysing. Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage. I grouped the data base on 'id' and 'hour'

I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green. The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.

    id    time                   fuel
0   1     2022-02-26 19:08:33    100
2   1     2022-02-26 20:09:35    None
3   2     2022-02-26 21:09:35    70
4   3     2022-02-26 21:10:55    60
5   4     2022-02-26 21:10:55    None
6   5     2022-02-26 22:12:43    50
7   6     2022-02-26 23:10:50    None

So for example, in the following code I computed the percentage of the missing value for every hour for specific id:

df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)

Is there any solution?


Solution

  • Update: The heatmap now plots id vs time vs percentage of null fuel. I've kept my original answer for id vs time vs fuel at the end of this post.


    I want something almost like a github style calendar.

    To mimic the GitHub contribution matrix, reset the grouped null percentages into a dataframe and pivot into 1 id per row and 1 hour per column. Then use sns.heatmap to color each cell based on percentage of null fuel.

    # convert to proper dtypes
    df['time'] = pd.to_datetime(df['time'])
    df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')
    
    # compute null percentage per (id, hour)
    nulls = (df.set_index('time')
               .groupby(['id', pd.Grouper(freq='H')])['fuel']
               .apply(lambda x: x.isnull().mean() * 100))
    
    # pivot into id vs time matrix
    matrix = (nulls.reset_index(name='null (%)')
                   .pivot(index='id', columns='time', values='null (%)'))
    
    # plot time series heatmap
    sns.heatmap(matrix, square=True, vmin=0, vmax=100, cmap='magma_r', cbar_kws={'label': 'null (%)'},
                linewidth=1, linecolor='lightgray', clip_on=False,
                xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))
    


    Original: This is for visualizing id by time by fuel:

    1. Pivot into an id vs time matrix. Normally pivot is fine, but since your real data contains duplicate indexes, use pivot_table.
    2. resample the time columns into hourly means.
    3. Plot the time series matrix using sns.heatmap.
    # convert to proper dtypes
    df['time'] = pd.to_datetime(df['time'])
    df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')
    
    # pivot into id vs time matrix
    matrix = df.pivot_table(index='id', columns='time', values='fuel', dropna=False)
    
    # resample columns into hourly means
    matrix = matrix.resample('H', axis=1).mean()
    
    # plot time series heatmap
    sns.heatmap(matrix, square=True, cmap='plasma_r', vmin=0, vmax=100, cbar_kws={'label': 'fuel (%)'},
                linewidth=1, linecolor='lightgray', clip_on=False,
                xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))