Search code examples
pythonpandas

Split strings in a Series, convert to array and average the values


I have a Pandas Series that has these unique values:

array(['17', '19', '21', '20', '22', '23', '12', '13', '15', '24', '25',
       '18', '16', '14', '26', '11', '10', '12/16', '27', '10/14',
       '16/22', '16/21', '13/17', '14/19', '11/15', '10/15', '15/21',
       '13/19', '13/18', '32', '28', '12/15', '29', '42', '30', '31',
       '34', '46', '11/14', '18/25', '19/26', '17/24', '19/24', '17/23',
       '13/16', '11/16', '15/20', '36', '17/25', '19/25', '17/22',
       '18/26', '39', '41', '35', '50', '9/13', '33', '10/13', '9/12',
       '93/37', '14/20', '10/16', '14/18', '16/23', '37', '9/11', '37/94',
       '20/54', '22/31', '22/30', '23/33', '44', '40', '50/95', '38',
       '16/24', '15/23', '15/22', '18/23', '16/20', '37/98', '19/27',
       '38/88', '23/31', '14/22', '45', '39/117', '28/76', '33/82',
       '15/19', '23/30', '47', '46/115', '14/21', '17/18', '25/50',
       '12/18', '12/17', '21/28', '20/27', '26/58', '22/67', '22/47',
       '25/51', '35/83', '39/86', '31/72', '24/56', '30/80', '32/85',
       '42/106', '40/99', '30/51', '21/43', '52', '56', '25/53', '34/83',
       '30/71', '27/64', '35/111', '26/62', '32/84', '39/95', '18/24',
       '22/29', '42/97', '48', '55', '58', '39/99', '49', '43', '40/103',
       '22/46', '54/133', '25/54', '36/83', '29/72', '28/67', '35/109',
       '25/62', '14/17', '42/110', '52/119', '20/60', '46/105', '25/56',
       '27/65', '25/74', '21/49', '29/71', '26/59', '27/62'], dtype=object)

The ones that have the '/', I want to split these into arrays and then average their values. One simpler but a flawed approach is to simply extract the first value: master_data["Cmb MPG"].str.split('/').str[0].astype('int8')

However, what I truly require is the two values being averaged. I have tried several commands and this one:

np.array(master_data["Cmb MPG"].str.split('/')).astype('int8').mean()

Should ideally do the job, but I get a ValueError followed by a TypeError:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[88], line 1
----> 1 np.array(master_data["Cmb MPG"].str.split('/')).astype('int8')

ValueError: setting an array element with a sequence.

The slice() method returns a Series but it won't proceed either with the splitting of strings.

What is required is:

'18/25' ---> [18, 25] ---> 22 (rounded)

Solution

  • I would use extractall and groupby.mean:

    s = pd.Series(['10', '12/16', '27', '10/14', '16/22', '16/21', '13/17'])
    
    out = (s.str.extractall(r'(\d+)')[0].astype(int).groupby(level=0).mean()
            .round().astype(int)
           )
    

    You could also go with split and mean, but this generates a more expensive intermediate and will not scale as well if you have many items (1/2/3/4/5):

    out = (s.str.split('/', expand=True).astype(float).mean(axis=1)
            .round().astype(int)
          )
    

    Output:

    0    10
    1    14
    2    27
    3    12
    4    19
    5    18
    6    15
    dtype: int64