|    .  ___  __   __   ___ .  __      __   __        __   __   __      
|    | |__  |__) |__) |__  ' /__`    /__` /  \  /\  |__) |__) /  \ \_/ 
|___ | |___ |  \ |  \ |___   .__/    .__/ \__/ /~~\ |    |__) \__/ / \ 
⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅∽⋅∽⋅

Nachtigall: implementation plan

I want to create a little program, dubbed "Nachtigall". Its specification is as follows:

Now that we have a specification, let's design an implementation.

We need to perform pitch detection. Like my quick research yielded in the past, a good way to do that is using the pYin pitch estimator. Fortunately for us, there are publicly available implementation of it. For instance librosa, a "python package for music and audio analysis", linked below, has a version of it. I also considered audioFlux, a "deep learning tool library for audio and music analysis", which has a nice pYin-based pitch detection example as well. But long story short, audioFlux's codebase has about 5 contributors while librosa's has about 100 and more recent signs of life. That deals with the muscle of the whole project.

Now, the example of the pYin feature in librosa that is relevant for us looks like this

y, sample_rate = librosa.load(soundfile)

f0, voiced_flag, voiced_probs = librosa.pyin(y,
                                             sr=sample_rate,
                                             fmin=librosa.note_to_hz('C2'),
                                             fmax=librosa.note_to_hz('C7'))

Which is pretty straightforward. Fetch your sound file, deduce its sample rate, feed it to the algorithm as well as some boundary settings concerning the frequencies to be detected. We then have f0 the fundamental frequency of whatever we fed. That's step one. Then we need to convert that fundamental frequency to something we can work with.

Fortunately, as per the doc:

f0: np.ndarray [shape=(…, n_frames)]

time series of fundamental frequencies in Hertz.

Which is neat, but we need to know which part of the array corresponds to separate notes. Therefore onset detection is necessary. Fortunately, the librosa examples have us covered. Huh. To the point that the rest is actually solved and I'll be mostly rambling over it.

First, they recommend suppressing some of the sound using the detected voiced regions that pyin found for us. Then they pass it through a synthesiser, which apparently results in a cleaner signal.

Then the onset envelopes are detected and the corresponding onset times are deduced. The example then voices them through clicks. Which is neat but diverges from our goals. But we're not far from it.

If we take the f0 array, cut it at the onset times, it should result in smaller arrays containing sounds around a specific note. We could then pass each tick to librosa.hz_to_note and take either the median or mean result.

This would result in a string array with the note labels. Which we can print out, achieving our goal. Apparently, hz_to_note works by highlighting sharps and not flats which is what we want. Sounds good.

So let's create a virtual python environment to keep things clean, install librosa, implement the demo, verify it works. Then tweak it to print out the notes. Finally, let's handle the soundfile input more elegantly than as a hard-coded variable and we should be set.

Resources

original post
pYin: A fundamental frequency estimator using probabilistic threshold distributions (M. Mauch and S. Dixon)
Link to librosa's pyin pitch detection functionality
librosa audio_playback example with both pitch and onset detection
Link to audioFlux pitch detection example

⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅

home
posts
about