python python-3.x regex seaborn python-itertools

RegEx expression to match alphanumeric ID and sum index

I have a directory of files in the format of [alphanumeric]_[integer].wav for example,

gdg36dhd3d_0.wav
gdg36dhd3d_1.wav
gdg36dhd3d_2.wav
344fikuo4q_0.wav
344fikuo4q_1.wav

The alphanumeric is the ID whereas the second number is the index. The following code should loop through a list of filenames and sum the index for each ID, and then plot a histplot of the data.

 # sum the number of isolated events for every original sample
 number_isolated_events = list()
 for k, g in itertools.groupby(isolated_filenames, key=lambda x:re.search('(\w+)_', x).group(1)):
     number_events = len(list(g))
     number_isolated_events.append(number_events)

 sns.histplot(number_isolated_events, kde=True, color='b')

However, the histplot looks like this when the above code is run and there are 20 items in isolated_filenames:

The count is 1.0, which I assume means that only one filename is being matched to the RegEx expression '(\w+)_'.

The x-axis is also showing what seems like the total number of indexes combined. I would assume a maximum of no more than 6 indexes per ID. My guess is that all the indexes have been summed and associated with the first filename rather than their respective filenames somehow. Is the issue my RegEx expression or is there a better way to achieve what I'm looking for?

Solution

You can use

import itertools
isolated_filenames = ['gdg36dhd3d_0.wav','gdg36dhd3d_1.wav','gdg36dhd3d_2.wav','344fikuo4q_0.wav','344fikuo4q_1.wav']
l = [(x.rsplit('_')[0], x.rsplit('_')[-1][:-4]) for x in isolated_filenames]
number_isolated_events = []
for k, g in itertools.groupby(l, key=lambda x: x[0]):
    number_isolated_events.append(tuple([k, len(list(g))])) # if you need to count group items
    #number_isolated_events.append(tuple([k, sum(int(z[1]) for z in g)])) # if you need to sum numeric suffixes

print(number_isolated_events)

See the Python demo. Output:

[('gdg36dhd3d', 3), ('344fikuo4q', 2)]

Notes:

[(x.rsplit('_')[0], x.rsplit('_')[-1][:-4]) for x in isolated_filenames] creates a list of file parts before _ and the number after _ before the .wav extension (I assume each file here has .wav extension)
itertools.groupby(l, key=lambda x: x[0]) groups by the file "root" name (without numeric suffix)
len(list(g)) fetches the length of found items per key
sum(int(z[1]) for z in g) sums the numeric suffixes for each file.