Search code examples
pythoncsvpandasmulti-index

Length mismatch error when scaling up multiIndex slicer on large dataset


I am trying to split imported csv files (timeseries) and manipulated them, using pandas MultiIndex Slicer command .xs(). The following df replicates the structure of my imported csv file.

import pandas as pd

df = pd.DataFrame(
    {'Sensor ID': [14,1,3,14,3],
    'Building ID': [109,109,109,109,109],
    'Date/Time': ["26/10/2016 14:31:14","26/10/2016 14:31:16", "26/10/2016 14:32:17", "26/10/2016 14:35:14", "26/10/2016 14:35:38"],
    'Reading': [20.95, 20.62, 22.45, 20.65, 22.83],
    })

df.set_index(['Sensor ID','Date/Time'], inplace=True)
df.sort_index(inplace=True)
print(df)

SensorList = [1, 3, 14]

for s in SensorList:
    df1 = df.xs(s, level='Sensor ID')

I have tested the code on a small excerpt of csv data and it works fine. However, when implementing with the entire csv file, I receive the error: ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements.

Printing df.info() returns the following:

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 65981 entries, (1, 2016-10-26 14:35:15) to (19, 2016-11-07 11:27:14)
Data columns (total 2 columns):
Building ID    65981 non-null int64
Reading        65981 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5+ MB
None

Any tip on what may be causing the error?

EDIT

I inadvertently truncated my code, thus leaving it pointless in its current form. The original code resamples values into 15-minutes and 1-hour intervals.

with:

units = ['D1','D3','D6','D10']
unit_output_path = './' + unit + '/'

the loop does:

for s in SensorList:

    ## Slice multi-index to isolate all readings for sensor s
    df1 = df_mi.xs(s, level='Sensor ID')
    df1.drop('Building ID', axis=1, inplace=True)

    ## Resample by 15min and 1hr intervals and exports individual csv files
    df1_15min = df1.resample('15Min').mean().round(1)
    df1_hr = df1.resample('60Min').mean().round(1)

Traceback:

File "D:\AN6478\AN6478_POE_ABo.py", line 52, in <module>
    df1 = df_mi.xs(s, level='Sensor ID')
  File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1736, in xs
    setattr(result, result._get_axis_name(axis), new_ax)
  File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2685, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas\src\properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas\lib.c:44748)
  File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 428, in _set_axis
    self._data.set_axis(axis, labels)
  File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2635, in set_axis
    (old_len, new_len))
ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements

Solution

  • I can't tell you why exactly df1 = df_mi.xs(s, level='Sensor ID') raises the ValueError here. Where does df_mi come from?

    Here is an alternative using groupby which accomplishes what you want on your given dummy data frame without relying on multiIndex and xs. :

    # reset index to have DatetimeIndex, otherwise resample won't work
    df = df.reset_index(0)
    df.index = pd.to_datetime(df.index)
    
    # create data frame for each sensor, keep relevant "Reading" column
    grouped = df.groupby("Sensor ID")["Reading"]
    
    # iterate each sensor data frame
    for sensor, sub_df in grouped:
        quarterly = sub_df.resample('15Min').mean().round(1)
        hourly = sub_df.resample('60Min').mean().round(1)
    
        # implement your to_csv saving here
    

    Note, you could also use the groupby on the multiIndex with df.groupby(level="Sensor ID"), however since you want to resample later on, it is easier to drop Sensor ID from the multiIndex which simplifies it overall.