Pitch shifting ed

Some adventure story of writing a pitch shifter. With incorrect, but interesting ideas.

Pitch shifting
Naive approach
Mild obstacles
Simple fix
More obstacles
Fancier version, interesting failure
Perfect math
The horror
Boring, pragmatic method

Naive approach ed

Task (almost)

All frequency components \(f\) of a signal need to be moved to a different frequency \(f' = f \cdot K\) for a constant ratio \(K\).

As with the equalizers, this screams to be solved via FFT:

transform the whole signal into the frequency domain
remap any component \(z_i\) to the new array location \(z_j\) with \(j = i \cdot K\)
transform back into the time domain

Mild obstacles ed

First, we notice, that most new array locations \(i \cdot K\) don't fall on integer indices. When transforming a whole 5 min song, we can easily round up or down without any audible problems. Also multiple components mapped to the same location can be summed up.

Things get complicated, when the signal comes in small chunks (\(\le\) 1024 samples). Suddenly, rounding is more severe and introduces audible detuning and harmonic changes.

Simple fix ed

If the new index \(i \cdot K\) falls between two integers \(j\) and \(j+1\), add the component to both array locations with weights \((\frac{1}{2}, \frac{1}{2})\). (smoothly going to weights \((1,0)\) if we hit an integer index).

This is still bad! Now, a single sine wave input will create dissonant output when spread over two close frequencies.

More obstacles ed

Btw. chunks won't fit together anymore...

Fancier version, interesting failure ed

Actually, what is the perfect output of a sine wave input?

Perfect math ed

Let's say the input frequency is one of the FFT's frequencies. Then after the FFT, we get an array \((\dots,0,0,z,0,0,\dots)\) with a single entry.

Of course, the output should again be a sine wave, but not hitting an FFT frequency, so the FFT'ed output will be a very messy array. With some patience, we can compute the components of this array to be a sinc function, with its peak at the location \(i \cdot K\) but some decaying mess everywhere.

So, to create a pitch shifter, we need to take each component in the FFT array, use it to scale a sinc-function, sum it all up (and FFT back). This could be done with a matrix. An approximation would just interpolate between more neighbours around the index \(i \cdot K\).

But...

The horror ed

Yes, this perfectly solves the problem. All frequencies are shifted. But now, the song changes speed!

Why?

Because the speed of a song is also encoded in its frequencies. A steady drum beat for example with a beat every 1s will show up as a 1Hz component in the spectrum. "Perfect" pitch shifting would also shift this frequency, and therefore change the speed of the drums (same for melodies).

Actually, our fancy method is just stretching the signal (interpolation).

Wrong task definition? - Psychoaccustics!

Our ears are similar to an FFT. But they don't FFT a 5 min song to tell use afterwards which frequencies were present. Rather, out ears perform a "windowed" FFT. I.e. we analyze short chunks of audio at a time, with a finite time resolution. So, we need to distinguish between "audible" frequencies >20Hz inside chunks, that define pitch, and lower "speed" frequencies that describe patterns and changes between chunks.

Boring, pragmatic method ed

Define a window/chunk size (let's say 1024 samples) that corresponds to the ears time resolution).
Split the signal into chunks, but with large overlap (50%).
Perform the naive FFT-based pitch shift on each chunk individually.
Glue the shifted chunks together smoothly.

This solves the problem of speed changes/time resolution, but also the problem of non-matching chunks.

\(-_-)/

Categories: Blog, Audio programming