Search code examples
pythonholoviews

In holoviews, how do I sort categorical axes of an Overlay?


I have a pandas dataframe with 3 categorical columns (A,B,C), and 1 numeric (N). I plot a scatter plot with x-axis A, y-axis N, and stratified by B (lets just make this binary for ease of reference). There is thus a dot for every A-C combination, colored by B (2-colors). This results in a NdLayout object.

Now I am trying to get the order of the x-axis right, where the values are ordered by the absolute sum of values for that category (irrespective of strata, ie B).

If I simply sort the entries of A in the dataframe based on a group sum, it works for most cases. However in one case, there is no A-C entry for a particular strata of B, ie there is missing data. For example if B=1, then a value of A does not exist for that strata, but does exist for B=0. So when plotting this, the value gets added to the wrong place, as I am using NdLayout.

Is there a post plot process to change the factor ordering in a dimension?

import holoviews as hv
hv.extension("matplotlib")
import colorcet as cc

ds = hv.Dataset(data,kdims=["A"],vdims=["N","B"])
scatter = ds.to(hv.Scatter,"A","N","B").overlay().opts(opts.Scatter(color=hv.Cycle([cc.isolum[0]] + [cc.isolum[-1]]),xrotation=90))

Example:

A = ['Sample_{}'.format(ii) for ii in range(20)]
C = ['Category_{}'.format(ii) for ii in range(10)]
b_data = np.asarray([np.random.normal(0,xx+1,size=10) for xx in range(20)])

B_1 = pd.DataFrame(b_data,index=A,columns=C)
B_1 = B_1.rename_axis('A').reset_index().melt(id_vars='A',value_name='N',var_name='C')
B_1['B'] = 1

#create data set with one of the Sample_ entries removed.
b_data = np.asarray([np.random.normal(0,xx+1,size=10) for xx in range(19)])
B_0 = pd.DataFrame(b_data,index=A[:-1],columns=C)
B_0 = B_0.rename_axis('A').reset_index().melt(id_vars='A',value_name='N',var_name='C')
B_0['B'] = 0

myData = pd.concat([B_1,B_0])

featureOrder = myData.groupby('A')['N'].apply(lambda x: x.abs().sum()).sort_values(ascending=False).index
myData['A'] = pd.Categorical(myData.A, categories=featureOrder,ordered=True)
myData =myData.sort_values(by='A')

#generate plot using hvplot
myData.hvplot.scatter(x='A',y='N',by='B').opts(padding=0.1,xrotation=90)

#the following gives the same output, but doesn't use hvplot
ds = hv.Dataset(myData,kdims=["A"],vdims=["N","B"])
scatter = ds.to(hv.Scatter,"A","N","B").overlay().opts(opts.Scatter(color=hv.Cycle([cc.isolum[0]] + [cc.isolum[-1]]),xrotation=90))
print(featureOrder)
Index(['Sample_17', 'Sample_18', 'Sample_13', 'Sample_16', 'Sample_11',
       'Sample_15', 'Sample_14', 'Sample_10', 'Sample_19', 'Sample_12',
       'Sample_9', 'Sample_6', 'Sample_8', 'Sample_7', 'Sample_5', 'Sample_4',
       'Sample_3', 'Sample_2', 'Sample_1', 'Sample_0'],
      dtype='object', name='A')

enter image description here

From the plot Sample_19 is added to the end, while it should be 9th. If I change the values of B around, then the plot is in the correct order.


Solution

  • Based on your example above, if you compare scatter[1] * scatter[0] with scatter[0] * scatter[1], you'll see that it's the first element of an Overlay that defines the order of categorical axes, and the rest is just appended (as you already found out).

    One workaround for now is inserting NaNs for all missing data that you'd like to be part of the ordering.

    (For a more general approach to sorting of categorical axes, that is the stuff of several recent issues and will hopefully be implemented some day - see github issues linked in my comments.)