Search code examples
pythonpandasdataframepandas-groupby

Calculate Mean Absolute Error for each row of a Pandas dataframe


Below is a sample of pandas dataframe that I'm working with.
I want to calculate mean absolute error for each row but only considering relevant columns for values of ID column.

There may be 2 4,6 or 8 columns relevant to values of ID column. For example, relevant columns for 'id4' is 'id4_signal1_true' and 'id4_signal1_pred'. for 'id1' its 'id1_signal1_true', 'id1_signal2_true', 'id1_signal1_pred' and 'id1_signal2_pred'.

import pandas as pd
list = [['id4',0.37,0.97,0.21,0.54,0.11,0.38,0.95,0.2,0.5,0.23],
        ['id1',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23],
        ['id3',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23]]

df = pd.DataFrame(list, columns =['ID','id1_signal1_true','id1_signal2_true','id4_signal1_true','id3_signal1_true',
                                  'id3_signal2_true','id1_signal1_pred','id1_signal2_pred','id4_signal1_pred',
                                  'id3_signal1_pred','id3_signal2_pred'])

I want to calculate mae for each row only considering the specific relevant columns. for example, for the first row, it should be like mean(abs(id4_signal1_true-id4_signal1_pred)) for the second row, it should be mean(abs('id1_signal1_true'-'id1_signal1_pred'),abs('id1_signal2_true'-'id1_signal2_pred'))

below is a screenshot of how the output looks like. MAE is the column which i want to get enter image description here

I used the below code to solve this. this is working fine. But the only issue is i have around 2 million rows and this takes hours. i want to find a efficient way to do this. Highly appreciated your help on this

Attack = df
ID_MAE = []
for id in range(len(Attack['ID'])):

    signals = np.array(Attack[Attack.columns[Attack.columns.str.contains(Attack.ID[id])]]) # select columns relevant to current ID
    signal = signals[id]  # select only the specific row (id)

    no_of_signals = int(len(signal)/2) # identify number of signals
    reshaped_arr = np.reshape(signal, (2,no_of_signals))
    signal_true = reshaped_arr[0]  # array for true values
    signal_pred = reshaped_arr[1]  # array for predicted values

    # mae calculation
    MAE = np.mean(np.abs(signal_true - signal_pred), axis=0)
    ID_MAE.append(MAE)

df['MAE'] = ID_MAE

Solution

  • import numpy as np
    df2 = df.set_index('ID').apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.name)]].to_dict().items())))).reshape(-1,2)),  axis=1)
    df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())
    

    output:

    ID
    id4    0.01
    id1    0.02
    id3    0.08
    dtype: float64
    

    UPDATE:

    or you can:

    df2 = df.apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2)),  axis=1)
    df["MAE"] = df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())
    

    well, well, well... now explanation.. let's go :)) notice: I explain update part

    what is problem? you are some columns and rows... and for each row, you want to have corresponding columns... mean only columns that startswith column's name...Hmm, new idea...

    df.apply(lambda x: x.index.str.startswith(x.ID), axis=1)
    

    output:

    0    [False, False, False, True, False, False, Fals...
    1    [False, True, True, False, False, False, True,...
    2    [False, False, False, False, True, True, False...
    

    As you see, for every row, give that that each column start with (is correspond) or not (notice: ID is id4,id1,...)

    ok, next you must get all columns that are correspond, using:

    df.apply(lambda x: (ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)], axis=1)
    

    output:

    1   2   3   4   5   6   7   8   9   10
    0   NaN NaN id4_signal1_true    NaN NaN NaN NaN id4_signal1_pred    NaN NaN
    1   id1_signal1_true    id1_signal2_true    NaN NaN NaN id1_signal1_pred    id1_signal2_pred    NaN NaN NaN
    2   NaN NaN NaN id3_signal1_true    id3_signal2_true    NaN NaN NaN id3_signal1_pred    id3_signal2_pred
    

    as you know, you can pass a list of boolean as index to pandas series and get all columns that are True...

    wait a minute, it can be more simple... (because x.index is series, itself)

    df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]], axis=1)
    

    ok,we get all correspond columns, so what? nothing, as you see, some columns are NaN and we must rid of them, so convert data to list of name-value pair using to_dict().items():

    df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]].to_dict().items(), axis=1)
    

    output:

    0    ((id4_signal1_true, 0.21), (id4_signal1_pred, ...
    1    ((id1_signal1_true, 0.41), (id1_signal2_true, ...
    2    ((id3_signal1_true, 0.54), (id3_signal2_true, ...
    dtype: object
    

    why we need names? because we need to calculate MAE between correct pairs...

    ok, now we have pairs, but in incorrect order... how can we sort it? we now that correct pairs have same name, except in last part: pred and true... so let's sort them based on names:

    df.apply(lambda x: sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items()),  axis=1)
    

    output:

    0    [(id4_signal1_pred, 0.2), (id4_signal1_true, 0...
    1    [(id1_signal1_pred, 0.41), (id1_signal1_true, ...
    2    [(id3_signal1_pred, 0.5), (id3_signal1_true, 0...
    

    oh, yes, they are in in correct order and we can calculate MAE for each pair and so, we can rid of names, so map on each list and get second elements:

    df.apply(lambda x: list(map(lambda t: t[1], sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))),  axis=1)
    

    output:

    0                 [0.2, 0.21]
    1    [0.41, 0.41, 0.48, 0.44]
    2     [0.5, 0.54, 0.23, 0.11]
    dtype: object
    

    ok... now, we can calculate MAE for each pair, but how we can convert each list to list of pairs... hmmm... NumPy!!! and using .reshape(-1,2) we convert it to pairs and calculate MAE for each pair:

    (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2))
    

    output:

    0                   [[0.2, 0.21]]
    1    [[0.41, 0.41], [0.48, 0.44]]
    2     [[0.5, 0.54], [0.23, 0.11]]
    dtype: object
    

    wait a minute... we use NumPy... and why don't use further?

    df.apply(lambda x: np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2),  axis=1)
    

    convert sorted output to numpy.array and get second element using: [:,1] now, just calculate MAE for each pair:

    df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]))
    

    output:

    0                         [0.009999999999999981]
    1                     [0.0, 0.03999999999999998]
    2    [0.040000000000000036, 0.12000000000000001]
    dtype: object
    

    we calculate absolute difference for each pair... and, again, we can make it simpler:

    df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))), axis=1)
    

    and, finally, we calculate mean for each numpy.array

    The third and more simple and fast manner:

    df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))).mean(), axis=1)
    

    I attempted to explain it in simple word, hope be helpfull