Pitch shifting ed
Some adventure story of writing a pitch shifter. With incorrect, but interesting ideas.
Naive approach ed
- Task (almost)
As with the equalizers, this screams to be solved via FFT:
- transform the whole signal into the frequency domain
- remap any component \(z_i\) to the new array location \(z_j\) with \(j = i \cdot K\)
- transform back into the time domain
Mild obstacles ed
First, we notice, that most new array locations \(i \cdot K\) don't fall on integer indices. When transforming a whole 5 min song, we can easily round up or down without any audible problems. Also multiple components mapped to the same location can be summed up.
Things get complicated, when the signal comes in small chunks (\(\le\) 1024 samples). Suddenly, rounding is more severe and introduces audible detuning and harmonic changes.
Simple fix ed
If the new index \(i \cdot K\) falls between two integers \(j\) and \(j+1\), add the component to both array locations with weights \((\frac{1}{2}, \frac{1}{2})\). (smoothly going to weights \((1,0)\) if we hit an integer index).
This is still bad! Now, a single sine wave input will create dissonant output when spread over two close frequencies.
More obstacles ed
Btw. chunks won't fit together anymore...
Fancier version, interesting failure ed
Actually, what is the perfect output of a sine wave input?
Perfect math ed
Let's say the input frequency is one of the FFT's frequencies. Then after the FFT, we get an array \((\dots,0,0,z,0,0,\dots)\) with a single entry.
Of course, the output should again be a sine wave, but not hitting an FFT frequency, so the FFT'ed output will be a very messy array. With some patience, we can compute the components of this array to be a sinc function, with its peak at the location \(i \cdot K\) but some decaying mess everywhere.
So, to create a pitch shifter, we need to take each component in the FFT array, use it to scale a sinc-function, sum it all up (and FFT back). This could be done with a matrix. An approximation would just interpolate between more neighbours around the index \(i \cdot K\).
But...
The horror ed
Yes, this perfectly solves the problem. All frequencies are shifted. But now, the song changes speed!
- Why?
Actually, our fancy method is just stretching the signal (interpolation).
- Wrong task definition? - Psychoaccustics!
Boring, pragmatic method ed
- Define a window/chunk size (let's say 1024 samples) that corresponds to the ears time resolution).
- Split the signal into chunks, but with large overlap (50%).
- Perform the naive FFT-based pitch shift on each chunk individually.
- Glue the shifted chunks together smoothly.
This solves the problem of speed changes/time resolution, but also the problem of non-matching chunks.
\(-_-)/