How can I get the start and end indices of a note in a volume graph?

I am trying to make a program, that tells me when a note has been pressed.

I have the following notes exported as a .wav file (The C Major Scale 4 times with different rhythms, dynamics and in different octaves):

I can get the volumes of my sound file using the following code:

from scipy.io import wavfile

def get_volume(file):
    sr, data = wavfile.read(file)

    if data.ndim > 1:
        data = data[:, 0]

    return data

volumes = get_volume("FILE")

Here are some information about the output:

Max: 27851
Min: -25664
Mean: -0.7569383391943734
A Sample from the array: [ -7987  -8615  -8983  -9107  -9019  -8750  -8324  -7752  -7033  -6156
  -5115  -3920  -2610  -1245    106   1377   2520   3515   4364   5077
   5659   6113   6441   6639   6708   6662   6518   6288   5962   5525
   4963   4265   3420   2418   1264    -27  -1429  -2901  -4388  -5814
  -7101  -8186  -9028  -9614  -9955 -10077 -10012  -9785  -9401  -8846]

And here is what I get when I plot the volumes array (x is the index, y is the volume):

I want to get the indices of the start and end of the notes like the ones in the image (Did it by hand not accurate):

When I looked at the data I realized, that it is a 1d array and I also noticed, that when a note gets louder or quiter it is not smooth. It is like a ZigZag, but there is still a trend. So basically I can't just get the gradients (slope) of each point. So I though about grouping notes into batches and getting the average gradient there and thus doing the calculations with it, like so:

def get_average_gradient(arr):
    # Calculates average gradient
    return sum([i - (sum(arr) / len(arr)) for i in arr]) / len(arr)


def get_note_start_end(arr_size, batch_size, arr):
    # Finds start and end indices
    ranges = []
    curr_range = [0]

    prev_slope = curr_slope = "NO SLOPE"
    has_ended = False

    for i, j in enumerate(arr):
        if j > 0:
            curr_slope = "INCREASING"
        elif j < 0:
            curr_slope = "DECREASING"
        else:
            curr_slope = "NO SLOPE"

        if prev_slope == "DECREASING" and not has_ended:
            if i == len(arr) - 1 or arr[i + 1] < 0:
                if curr_slope != "DECREASING":
                    curr_range.append((i + 1) * batch_size + batch_size)
                    ranges.append(curr_range)
                    curr_range = [(i + 1) * batch_size + batch_size + 1]
                    has_ended = True

        if has_ended and curr_slope == "INCREASING":
            has_ended = False

        prev_slope = curr_slope

    ranges[-1][-1] = arr_size - 1

    return ranges


def get_notes(batch_size, arr):
    # Gets the gradients of the batches
    out = []

    for i in range(0, len(arr), batch_size):
        if i + batch_size > len(arr):
            gradient = get_average_gradient(arr[i:])
        else:
            gradient = get_average_gradient(arr[i: i+batch_size])

        # print(gradient, i)
        out.append(gradient)

    return get_note_start_end(len(arr), batch_size, out)

notes = get_notes(128, volumes)

The problem with this is, that if the batch size is too small, then it returns the indices of small peaks, which aren't a note on their own. If the batch size is too big then the program misses the start and end indices.

I also tried to get the notes, by using the silence. Here is the code I used:

from pydub import AudioSegment, silence

audio = intro = AudioSegment.from_wav("C - Major - Test.wav")
dBFS = audio.dBFS

notes = silence.detect_nonsilent(audio, min_silence_len=50, silence_thresh=dBFS-10)

This worked the best, but it still wasn't good enough. Here is what I got:

It some notes pretty well, but it wasn't able to identify notes accurately if the notes themselves didn't become very quite before a different one was played (Like in the second scale and in the fourth scale).

I have been thinking about this problem for days and I have basically tried most if not all of the good(?) ideas I had. I am new to analysing audio files. Maybe I am using the wrong data to do what I want to do. Maybe I need to use the frequency data (I tried getting it, but couldn't make sense of it) Frequency code:

from scipy.fft import *
from scipy.io import wavfile
import matplotlib.pyplot as plt


def get_freq(file, start_time, end_time):
    sr, data = wavfile.read(file)

    if data.ndim > 1:
        data = data[:, 0]
    else:
        pass

    # Fourier Transform
    N = len(data)
    yf = rfft(data)
    xf = rfftfreq(N, 1 / sr)

    return xf, yf


FILE = "C - Major - Test.wav"

plt.plot(*get_freq(FILE, 0, 10))
plt.show()

And the frequency graph:

And here is the .wav file: https://drive.google.com/file/d/1CERH-eovu20uhGoV1_O3B2Ph-4-uXpiP/view?usp=sharing

Any help is appreciated :)

Solution

think this is what you need: first you convert negative numbers into positive ones and smooth the line to eliminate noise, to find the lower peaks yo work with the negative values.

from scipy.io import wavfile
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
import numpy as np
from scipy.signal import savgol_filter

def get_volume(file):
    sr, data = wavfile.read(file)
    if data.ndim > 1:
        data = data[:, 0]
    return data

v1 = abs(get_volume("test.wav"))
#Smooth the curve
volumes=savgol_filter(v1,10000 , 3)
lv=volumes*-1
#find peaks
peaks,_ = find_peaks(volumes,distance=8000,prominence=300)
lpeaks,_= find_peaks(lv,distance=8000,prominence=300)
# plot them
plt.plot(volumes)
plt.plot(peaks,volumes[peaks],"x")
plt.plot(lpeaks,volumes[lpeaks],"o")
plt.plot(np.zeros_like(volumes), "--", color="gray")
plt.show()

Plot with your test file, x marks the high peaks and o the lower peaks