Search code examples
pythonpandasdataframenumpystandard-deviation

TypeError: incompatible index of inserted column with frame index when grouping 2 columns


I have a dataset that looks like this (+ some other cols):

Value         Theme       Country
-1.975767     Weather     China
-0.540979     Fruits      China
-2.359127     Fruits      China
-2.815604     Corona      Brazil
-0.929755     Weather     UK
-0.929755     Weather     UK

I want to find standard deviations for the values after grouping by Themes and Countries (as explained here calculate standard deviation by grouping two columns

df = pd.read_csv('./Brazil.csv')
df['std'] = df.groupby(['themes', 'country'])['value'].std()

However, currently, I get this error:

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3656, in DataFrame.__setitem__(self, key, value)
   3653     self._setitem_array([key], value)
   3654 else:
   3655     # set column
-> 3656     self._set_item(key, value)

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3833, in DataFrame._set_item(self, key, value)
   3823 def _set_item(self, key, value) -> None:
   3824     """
   3825     Add series to DataFrame in specified column.
   3826 
   (...)
   3831     ensure homogeneity.
   3832     """
-> 3833     value = self._sanitize_column(value)
   3835     if (
   3836         key in self.columns
   3837         and value.ndim == 1
   3838         and not is_extension_array_dtype(value)
   3839     ):
   3840         # broadcast across multiple columns if necessary
   3841         if not self.columns.is_unique or isinstance(self.columns, MultiIndex):

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:4534, in DataFrame._sanitize_column(self, value)
   4532 # We should never get here with DataFrame value
   4533 if isinstance(value, Series):
-> 4534     return _reindex_for_setitem(value, self.index)
   4536 if is_list_like(value):
   4537     com.require_length_match(value, self.index)

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:10985, in _reindex_for_setitem(value, index)
  10981     if not value.index.is_unique:
  10982         # duplicate axis
  10983         raise err
> 10985     raise TypeError(
  10986         "incompatible index of inserted column with frame index"
  10987     ) from err
  10988 return reindexed_value

TypeError: incompatible index of inserted column with frame index

Solution

  • You can use a rolling method to calculate cumulative standard deviations for each group.

    Code

    import pandas as pd
    
    # Create a sample dataframe
    import io
    text_csv = '''Value,Theme,Country
    -1.975767,Weather,China
    -0.540979,Fruits,China
    -2.359127,Fruits,China
    -2.815604,Corona,Brazil
    -0.929755,Weather,UK
    -0.929755,Weather,UK'''
    df = pd.read_csv(io.StringIO(text_csv))
    
    # Calculate cumulative standard deviations
    df_std = df.groupby(['Theme', 'Country'], as_index=False)['Value'].rolling(len(df), min_periods=1).std()
    
    # Merge the original df with the cumulative std values
    df_std = df.join(df_std.drop(['Theme', 'Country'], axis=1).rename(columns={'Value': 'CorrectedStd'}))
    

    Output

    Value Theme Country CorrectedStd
    0 -1.97577 Weather China nan
    1 -0.540979 Fruits China nan
    2 -2.35913 Fruits China 1.28562
    3 -2.8156 Corona Brazil nan
    4 -0.929755 Weather UK nan
    5 -0.929755 Weather UK 0