|    .  ___  __   __   ___ .  __      __   __        __   __   __      
|    | |__  |__) |__) |__  ' /__`    /__` /  \  /\  |__) |__) /  \ \_/ 
|___ | |___ |  \ |  \ |___   .__/    .__/ \__/ /~~\ |    |__) \__/ / \ 
⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅∽⋅∽⋅

Nachtigall: dev log 1

Introduction

We're creating a little program that takes in a soundfile, estimates the pitch and returns it as human readable notation. See previously:

specification and motivation
implementation planning
dev log 0

So far we have:

# see license at the end of the post
import numpy as np
import librosa
import soundfile as sf

y, sr = librosa.load("recording_test.wav")
f0, voiced_flag, voiced_probs = librosa.pyin(y,
                                             sr=sr,
                                             fmin=librosa.note_to_hz('C0'),
                                             fmax=librosa.note_to_hz('C7'),
                                             fill_na=None)

# Compute the onset strength envelope, using a max filter of 5 frequency bins
# to cut down on false positives
onset_env = librosa.onset.onset_strength(y=y, sr=sr, max_size=5)

# Detect onset times from the strength envelope
onset_times = librosa.onset.onset_detect(onset_envelope=onset_env, sr=sr, units='time')

# Create timestamps to match against the onset_times
times = librosa.times_like(f0)

# Store the start and end indices of the notes
f0_indices_note_starts=-1*np.ones_like(onset_times[1:],int)
f0_indices_note_ends  =-1*np.ones_like(onset_times[1:],int)

for i in range(len(onset_times)-1):
    onset_start=onset_times[i]
    onset_end  =onset_times[i+1]

    for j in range(len(times-1)):
        is_start_found = f0_indices_note_starts[i] != -1
        is_end_found = f0_indices_note_ends[i] != -1

        if is_start_found and is_end_found:
            break
        if onset_start<=times[j+1] and times[j]<onset_start:
            f0_indices_note_starts[i] = j+1
        if onset_end<=times[j+1] and times[j]<onset_end:
            f0_indices_note_ends[i]   = j+1

assert not -1 in f0_indices_note_starts, f"Start indice detection issue, {f0_indices_note_starts}"
assert not -1 in f0_indices_note_ends, f"End indice detection issue, {f0_indices_note_ends}"
assert all(0<(f0_indices_note_ends-f0_indices_note_starts)), f"Start indices larger than end indices: start indices {f0_indices_note_starts} end indices {f0_indices_note_ends}"


# Extract the frequency ranges and convert to legible notes
notes_as_str=[]
for s,e in zip(f0_indices_note_starts,f0_indices_note_ends):
    valid_frequencies=f0[s:e+1][voiced_flag[s:e+1]]
    sequence_as_str = librosa.hz_to_note(valid_frequencies)
    values, counts = np.unique(sequence_as_str, return_counts=True)
    most_frequent = np.argmax(counts)
    notes_as_str.append(values[most_frequent])

print(f"{len(notes_as_str)} notes detected:")
print(",".join(notes_as_str))

The problem

If the program gets a clean, machine produced sample, all is good. If a voice sample gets fed though, the voiced_flag array is all false. Basically, the pyin algorithm struggles to see it as voiced frequencies. If we just take all frequencies however, notes separated with lots of silence until the next note are badly classified, as the "frequencies of the silence" contribute to the final value.

What can we do

We could focus on detecting note ending. One fun way to do that would be to reverse the imput sample, then pass it through the onset detection algorithm and see if it detects the end of the note well enough. It's a bit ridiculous, I like it. Come to think of it, if there is onset detection in librosa, is there something for the note ends as well? No, and reading the onset detection documentation, it will detect the peaks in the envelope like before and bring is nowhere. So that's a no-go.

We could detect the silences, which should be minimums in amplitudes. If I understand correctly, if I were to always start the recordings with a bit of silence, I would have in y an estimate of the background silence. If I null every entry of y below that amplitude, maybe I can get somewhere. The question is, how long? I have the sample rate, so I can compute, say, half a second. Then take the mean amplitude of that silence and null all above. Sounds doable, let's try it out.

Attempt A, background noise suppression

Sample rate = n samples/ time hence:

sample_range = 0.5s times sample rate.

# just after loading the sample
# silence suppression
ARBITRARY_SILENCE_PRIOR_RECORDING = 0.5 #s
background_samples = int(ARBITRARY_SILENCE_PRIOR_RECORDING*sr)

background_amplitude = np.mean(y[0:background_samples])

# null everything below the estimated background noise amplitude
y[y<background_amplitude] = 0

[...]

# Extract the frequency ranges and convert to legible notes
notes_as_str=[]
for s,e in zip(f0_indices_note_starts,f0_indices_note_ends):
    valid_frequencies=f0[s:e+1] # Cancel the voiced_flag influence
    sequence_as_str = librosa.hz_to_note(valid_frequencies)
    values, counts = np.unique(sequence_as_str, return_counts=True)
    most_frequent = np.argmax(counts)
    notes_as_str.append(values[most_frequent])

Yields:

C2,C3,C1,C♯1,F1,C0

For a C2,C3,C1,C♯1,F1,E1 ground truth. Not there yet. But I'll keep it in, as it does not work horribly for the voiced samples. (Edit: bad idea, as we'll see.)

Attempt B, potentially better background noise suppression

If we look at the misclassified note, we have the following.

# valid_frequencies variable for the C0
[41.20344461 41.20344461 41.20344461 41.20344461 41.20344461 41.20344461
 41.20344461 44.1607585  37.13462352 31.2263718  26.25814411 22.08037925
 18.56731176 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783
 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783
 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783
 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783
 16.35159783 16.35159783 16.35159783 16.35159783 16.35159783 18.78305353
 22.3369409  26.71712838 31.95625364 38.22275105 45.71808429 54.68322331
 65.40639133]

It starts solidly, then we get trailing nonsense. I could boost the noise suppression. Provided there is no click of a button, maybe it will erase that part to nothingness. I set it to max instead of average. Nothing changes, which is weird. Does the noise suppression even work? No, variable investigation indicates that the variable y contains the same before and after the noise suppression. Huh. In the clean, machine produced .wav, the sound is indeed 0 for half a second. And for the voice? Same? The average is a negative value, because those are amplitudes... My simplistic attempt is not good enough, one would have to cut an entire amplitude band. Attempt A was misguided.

Librosa must have tools for this. Ah, lucky for us, while investigating the doc, I found librosa.effects.trim, precisely made for noise suppression ("Trim leading and trailing silence from an audio signal." says the doc). Let's try it out. It's as simple as:

y_trimmed, _ = librosa.effects.trim(y)

It uses the peaks in the original signal as reference. Replacing the y with it in the script is trivial, but it still leads to the same result. One reread later, we can input a reference in decibel to silence out. Let's try it out. Even when I put 80 db it does not appear to do a thing. I must miss something. Let's put an absurdly high value. Ah, rereading the doc indicates that it's a reference in decibel below the reference, so it's relative... Still does not change any value.

Attempt C, verifying that the ground truth is alright

I checked the machine generated .wav that I use for the tests above. And I noticed that the profile of the last note was visually distinct. And indeed, in retrospect, the sound is low enough to clip a bit. That is not really representative. So let's record a new sample at a better range:

C5,D5,D#6,A#5,A5,C6

Feeding it to the program yields:

5 notes detected:

C3,D3,D♯4,A♯3,A3

That's... not the same. At least it's the right notes, but the octaves are poor and an entire note is missing. Making the fmin/fmax of pyin more generous doesn't change things. Alright, is it an issue of the m8, with which I am generating the whole thing? Trying it out with a voiced sample, the onset detection is wonky and the noise suppression does not appear to be super effective.

Alright, let's use another source: alda. Music sheet is: A4 B#4 C4 D4 F#4, with rests in between each notes.

What does it yield?

A4,C5,C4,D4

Disappointing. The amplitude is a bit weak though in the sample. If I boost it, is it better?

Same exact result.

Reassessing the situation

The program is wonky. Both the note detection and classification have many errors.

When I look at the onset detections with clicks, the starts are well done, the tail is often weird. N-1 notes are detected too often, meaning there is probably an issue with the loop that does the note segmenting. The detected frequencies at the beginning of the notes seems accurate enough. So it's really about cutting those notes better, it appears. Even if I damp it down to 0, it will get included.

First let's fix that N-1 note detection. Looking at the code, it screams at my eyes:

# Store the start and end indices of the notes
f0_indices_note_starts=-1*np.ones_like(onset_times[1:],int)
f0_indices_note_ends  =-1*np.ones_like(onset_times[1:],int)

for i in range(len(onset_times)-1):
    onset_start=onset_times[i]
    onset_end  =onset_times[i+1]

    for j in range(len(times-1)):
        is_start_found = f0_indices_note_starts[i] != -1
        is_end_found = f0_indices_note_ends[i] != -1

        if is_start_found and is_end_found:
            break
        if onset_start<=times[j+1] and times[j]<onset_start:
            f0_indices_note_starts[i] = j+1
        if onset_end<=times[j+1] and times[j]<onset_end:
            f0_indices_note_ends[i]   = j+1

I assumed the note end would have an onset too... In reality, there are as many notes as onsets.

f0_indices_note_starts=-1*np.ones_like(onset_times,int)
f0_indices_note_ends  =-1*np.ones_like(onset_times,int)

[...]

# handle last onset
f0_indices_note_starts[-1] = onset_times[-1]
f0_indices_note_ends[-1] = times[-1]

Now we get as many classifications as we should. They are still wrong of course. But we will fix that next.

Attempt D, throwing cleverness out of the window

I tried several other ways to detect silences, but with unreliable results so far. So let's do something much simpler: we'll only consider the first third of the sequence between onsets.

Will it cover all usecases? No. Will it cover my specific usecase, where notes are spelled out with silences separately and at roughly the same length time-wise?

# Extract the frequency ranges and convert to legible notes
notes_as_str=[]

ARBITRARY_SELECTOR = 0.3
for s,e in zip(f0_indices_note_starts,f0_indices_note_ends):
    print(f"s{s} e{e}")
    valid_frequencies=f0[s:e+1]
    selection_boundary = int(np.floor(len(valid_frequencies)*ARBITRARY_SELECTOR))
    sequence_as_str = librosa.hz_to_note(valid_frequencies[1:selection_boundary])
    values, counts = np.unique(sequence_as_str, return_counts=True)
    most_frequent = np.argmax(counts)
    notes_as_str.append(values[most_frequent])

And that's a nope.

Some onset detections yield apparently sometimes too short time ranges.

Summary

Sadly, not much process today. That's life sometimes. The next thing I want to try is to change the data representation in order to see if it is easier to solve using a spectrogram. We'll see.

References

librosa onset detection documentation
librosa.effects.trim

next dev log

Licence

Given that I technically have done a derivative of the existing software by remixing a librosa example file, here is the applicable license. Many thanks to Brian McFee as well as all the other contributors for making my life easier, I appreciate it.

ISC License

Copyright (c) 2013--2023, librosa development team.

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅

home
posts
about