| . ___ __ __ ___ . __ __ __ __ __ __ | | |__ |__) |__) |__ ' /__` /__` / \ /\ |__) |__) / \ \_/ |___ | |___ | \ | \ |___ .__/ .__/ \__/ /~~\ | |__) \__/ / \ ⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅∽⋅∽⋅
We're creating a little program that takes in a soundfile, estimates the pitch and returns it as human readable notation. See previously:
specification and motivation
implementation planning
dev log 0
So far we have:
# see license at the end of the post import numpy as np import librosa import soundfile as sf y, sr = librosa.load("recording_test.wav") f0, voiced_flag, voiced_probs = librosa.pyin(y, sr=sr, fmin=librosa.note_to_hz('C0'), fmax=librosa.note_to_hz('C7'), fill_na=None) # Compute the onset strength envelope, using a max filter of 5 frequency bins # to cut down on false positives onset_env = librosa.onset.onset_strength(y=y, sr=sr, max_size=5) # Detect onset times from the strength envelope onset_times = librosa.onset.onset_detect(onset_envelope=onset_env, sr=sr, units='time') # Create timestamps to match against the onset_times times = librosa.times_like(f0) # Store the start and end indices of the notes f0_indices_note_starts=-1*np.ones_like(onset_times[1:],int) f0_indices_note_ends =-1*np.ones_like(onset_times[1:],int) for i in range(len(onset_times)-1): onset_start=onset_times[i] onset_end =onset_times[i+1] for j in range(len(times-1)): is_start_found = f0_indices_note_starts[i] != -1 is_end_found = f0_indices_note_ends[i] != -1 if is_start_found and is_end_found: break if onset_start<=times[j+1] and times[j]<onset_start: f0_indices_note_starts[i] = j+1 if onset_end<=times[j+1] and times[j]<onset_end: f0_indices_note_ends[i] = j+1 assert not -1 in f0_indices_note_starts, f"Start indice detection issue, {f0_indices_note_starts}" assert not -1 in f0_indices_note_ends, f"End indice detection issue, {f0_indices_note_ends}" assert all(0<(f0_indices_note_ends-f0_indices_note_starts)), f"Start indices larger than end indices: start indices {f0_indices_note_starts} end indices {f0_indices_note_ends}" # Extract the frequency ranges and convert to legible notes notes_as_str=[] for s,e in zip(f0_indices_note_starts,f0_indices_note_ends): valid_frequencies=f0[s:e+1][voiced_flag[s:e+1]] sequence_as_str = librosa.hz_to_note(valid_frequencies) values, counts = np.unique(sequence_as_str, return_counts=True) most_frequent = np.argmax(counts) notes_as_str.append(values[most_frequent]) print(f"{len(notes_as_str)} notes detected:") print(",".join(notes_as_str))
If the program gets a clean, machine produced sample, all is good. If a voice sample gets fed though, the voiced_flag array is all false. Basically, the pyin algorithm struggles to see it as voiced frequencies. If we just take all frequencies however, notes separated with lots of silence until the next note are badly classified, as the "frequencies of the silence" contribute to the final value.
We could focus on detecting note ending. One fun way to do that would be to reverse the imput sample, then pass it through the onset detection algorithm and see if it detects the end of the note well enough. It's a bit ridiculous, I like it. Come to think of it, if there is onset detection in librosa, is there something for the note ends as well? No, and reading the onset detection documentation, it will detect the peaks in the envelope like before and bring is nowhere. So that's a no-go.
We could detect the silences, which should be minimums in amplitudes. If I understand correctly, if I were to always start the recordings with a bit of silence, I would have in y an estimate of the background silence. If I null every entry of y below that amplitude, maybe I can get somewhere. The question is, how long? I have the sample rate, so I can compute, say, half a second. Then take the mean amplitude of that silence and null all above. Sounds doable, let's try it out.
Sample rate = n samples/ time hence:
sample_range = 0.5s times sample rate.
# just after loading the sample # silence suppression ARBITRARY_SILENCE_PRIOR_RECORDING = 0.5 #s background_samples = int(ARBITRARY_SILENCE_PRIOR_RECORDING*sr) background_amplitude = np.mean(y[0:background_samples]) # null everything below the estimated background noise amplitude y[y<background_amplitude] = 0 [...] # Extract the frequency ranges and convert to legible notes notes_as_str=[] for s,e in zip(f0_indices_note_starts,f0_indices_note_ends): valid_frequencies=f0[s:e+1] # Cancel the voiced_flag influence sequence_as_str = librosa.hz_to_note(valid_frequencies) values, counts = np.unique(sequence_as_str, return_counts=True) most_frequent = np.argmax(counts) notes_as_str.append(values[most_frequent])
Yields:
C2,C3,C1,C♯1,F1,C0
For a C2,C3,C1,C♯1,F1,E1 ground truth. Not there yet. But I'll keep it in, as it does not work horribly for the voiced samples. (Edit: bad idea, as we'll see.)
If we look at the misclassified note, we have the following.
# valid_frequencies variable for the C0 [41.20344461 41.20344461 41.20344461 41.20344461 41.20344461 41.20344461 41.20344461 44.1607585 37.13462352 31.2263718 26.25814411 22.08037925 18.56731176 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 18.78305353 22.3369409 26.71712838 31.95625364 38.22275105 45.71808429 54.68322331 65.40639133]
It starts solidly, then we get trailing nonsense. I could boost the noise suppression. Provided there is no click of a button, maybe it will erase that part to nothingness. I set it to max instead of average. Nothing changes, which is weird. Does the noise suppression even work? No, variable investigation indicates that the variable y contains the same before and after the noise suppression. Huh. In the clean, machine produced .wav, the sound is indeed 0 for half a second. And for the voice? Same? The average is a negative value, because those are amplitudes... My simplistic attempt is not good enough, one would have to cut an entire amplitude band. Attempt A was misguided.
Librosa must have tools for this. Ah, lucky for us, while investigating the doc, I found librosa.effects.trim, precisely made for noise suppression ("Trim leading and trailing silence from an audio signal." says the doc). Let's try it out. It's as simple as:
y_trimmed, _ = librosa.effects.trim(y)
It uses the peaks in the original signal as reference. Replacing the y with it in the script is trivial, but it still leads to the same result. One reread later, we can input a reference in decibel to silence out. Let's try it out. Even when I put 80 db it does not appear to do a thing. I must miss something. Let's put an absurdly high value. Ah, rereading the doc indicates that it's a reference in decibel below the reference, so it's relative... Still does not change any value.
I checked the machine generated .wav that I use for the tests above. And I noticed that the profile of the last note was visually distinct. And indeed, in retrospect, the sound is low enough to clip a bit. That is not really representative. So let's record a new sample at a better range:
C5,D5,D#6,A#5,A5,C6
Feeding it to the program yields:
5 notes detected:
C3,D3,D♯4,A♯3,A3
That's... not the same. At least it's the right notes, but the octaves are poor and an entire note is missing. Making the fmin/fmax of pyin more generous doesn't change things. Alright, is it an issue of the m8, with which I am generating the whole thing? Trying it out with a voiced sample, the onset detection is wonky and the noise suppression does not appear to be super effective.
Alright, let's use another source: alda. Music sheet is: A4 B#4 C4 D4 F#4, with rests in between each notes.
What does it yield?
A4,C5,C4,D4
Disappointing. The amplitude is a bit weak though in the sample. If I boost it, is it better?
Same exact result.
The program is wonky. Both the note detection and classification have many errors.
When I look at the onset detections with clicks, the starts are well done, the tail is often weird. N-1 notes are detected too often, meaning there is probably an issue with the loop that does the note segmenting. The detected frequencies at the beginning of the notes seems accurate enough. So it's really about cutting those notes better, it appears. Even if I damp it down to 0, it will get included.
First let's fix that N-1 note detection. Looking at the code, it screams at my eyes:
# Store the start and end indices of the notes f0_indices_note_starts=-1*np.ones_like(onset_times[1:],int) f0_indices_note_ends =-1*np.ones_like(onset_times[1:],int) for i in range(len(onset_times)-1): onset_start=onset_times[i] onset_end =onset_times[i+1] for j in range(len(times-1)): is_start_found = f0_indices_note_starts[i] != -1 is_end_found = f0_indices_note_ends[i] != -1 if is_start_found and is_end_found: break if onset_start<=times[j+1] and times[j]<onset_start: f0_indices_note_starts[i] = j+1 if onset_end<=times[j+1] and times[j]<onset_end: f0_indices_note_ends[i] = j+1
I assumed the note end would have an onset too... In reality, there are as many notes as onsets.
f0_indices_note_starts=-1*np.ones_like(onset_times,int) f0_indices_note_ends =-1*np.ones_like(onset_times,int) [...] # handle last onset f0_indices_note_starts[-1] = onset_times[-1] f0_indices_note_ends[-1] = times[-1]
Now we get as many classifications as we should. They are still wrong of course. But we will fix that next.
I tried several other ways to detect silences, but with unreliable results so far. So let's do something much simpler: we'll only consider the first third of the sequence between onsets.
Will it cover all usecases? No. Will it cover my specific usecase, where notes are spelled out with silences separately and at roughly the same length time-wise?
# Extract the frequency ranges and convert to legible notes notes_as_str=[] ARBITRARY_SELECTOR = 0.3 for s,e in zip(f0_indices_note_starts,f0_indices_note_ends): print(f"s{s} e{e}") valid_frequencies=f0[s:e+1] selection_boundary = int(np.floor(len(valid_frequencies)*ARBITRARY_SELECTOR)) sequence_as_str = librosa.hz_to_note(valid_frequencies[1:selection_boundary]) values, counts = np.unique(sequence_as_str, return_counts=True) most_frequent = np.argmax(counts) notes_as_str.append(values[most_frequent])
And that's a nope.
Some onset detections yield apparently sometimes too short time ranges.
Sadly, not much process today. That's life sometimes. The next thing I want to try is to change the data representation in order to see if it is easier to solve using a spectrogram. We'll see.
librosa onset detection documentation
librosa.effects.trim
Given that I technically have done a derivative of the existing software by remixing a librosa example file, here is the applicable license. Many thanks to Brian McFee as well as all the other contributors for making my life easier, I appreciate it.
Copyright (c) 2013--2023, librosa development team. Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅