I'm trying to get my first Matplotlib violin plot going and I'm using the exact code from this SO post but getting a KeyError error. I have no idea what that means. Any ideas?
Process pandas dataframe into violinplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.random.poisson(lam =3, size=100)
y = np.random.choice(["S{}".format(i+1) for i in range(6)], size=len(x))
df = pd.DataFrame({"Scenario":y, "LMP":x})
fig, axes = plt.subplots()
axes.violinplot(dataset = [df[df.Scenario == 'S1']["LMP"],
df[df.Scenario == 'S2']["LMP"],
df[df.Scenario == 'S3']["LMP"],
df[df.Scenario == 'S4']["LMP"],
df[df.Scenario == 'S5']["LMP"],
df[df.Scenario == 'S6']["LMP"] ] )
Error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-cd0789171d00> in <module>
15 df[df.Scenario == 'S4']["LMP"],
16 df[df.Scenario == 'S5']["LMP"],
---> 17 df[df.Scenario == 'S6']["LMP"] ] )
18
19 # axes.set_title('Day Ahead Market')
c:\Anaconda\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
1808 "the Matplotlib list!)" % (label_namer, func.__name__),
1809 RuntimeWarning, stacklevel=2)
-> 1810 return func(ax, *args, **kwargs)
1811
1812 inner.__doc__ = _add_data_doc(inner.__doc__,
c:\Anaconda\lib\site-packages\matplotlib\axes\_axes.py in violinplot(self, dataset, positions, vert, widths, showmeans, showextrema, showmedians, points, bw_method)
7915 return kde.evaluate(coords)
7916
-> 7917 vpstats = cbook.violin_stats(dataset, _kde_method, points=points)
7918 return self.violin(vpstats, positions=positions, vert=vert,
7919 widths=widths, showmeans=showmeans,
c:\Anaconda\lib\site-packages\matplotlib\cbook\__init__.py in violin_stats(X, method, points)
1460 # Evaluate the kernel density estimate
1461 coords = np.linspace(min_val, max_val, points)
-> 1462 stats['vals'] = method(x, coords)
1463 stats['coords'] = coords
1464
c:\Anaconda\lib\site-packages\matplotlib\axes\_axes.py in _kde_method(X, coords)
7910 def _kde_method(X, coords):
7911 # fallback gracefully if the vector contains only one value
-> 7912 if np.all(X[0] == X):
7913 return (X[0] == coords).astype(float)
7914 kde = mlab.GaussianKDE(X, bw_method)
c:\Anaconda\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
765 key = com._apply_if_callable(key, self)
766 try:
--> 767 result = self.index.get_value(self, key)
768
769 if not is_scalar(result):
c:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
3116 try:
3117 return self._engine.get_value(s, k,
-> 3118 tz=getattr(series.dtype, 'tz', None))
3119 except KeyError as e1:
3120 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 0
A KeyError
is raised whenever looking up an item in a container fails. The values used in these lookups are keys, and the error means 0
is not a valid key for the data frame.
DataFrame
objects are not traditional NumPy arrays. They contain an index which provides fast lookups of data based on more or less arbitrary information, including numeric data, but also dates, strings, and more. This is in contrast to the standard ndarray
s, which allow only a linear index (i.e., position) as valid keys. So when you do something like df[0]
, this is an attempt to find the value 0
in the frame's index, not to retrieve up the first item in the array.
However, if you do df[df.Scenario == 'S1']['LMP'].index
, you should see:
Int64Index([8, 20, 25, 27, 28, 35, 52, 57, 62, 68, 72, 74, 77, 80, 81, 83, 97], dtype='int64')
Note that 0
is nowhere to be found, hence the KeyError
. matplotlib
was designed to work with NumPy ndarray
objects, not Pandas DataFrame
objects. It knows nothing about this fancy indexing, and so these types of errors are common.
You have a few options to solve this. First, convert the data you'd like to plot to arrays. You can do this with df[df.Scenario == 'S1']['LMP'].values
, for each such array.
Another is to use a package like seaborn
, which is explicitly designed to work with Pandas frames. I highly recommend Seaborn in general, it's a very beautiful and well-designed package. It has its own version of the violinplot
, for example, which supports DataFrame
s and a whole host of options.