I've started working on rewriting the core synthesis routine.

In the past, I've avoided using the IFFT to generate waveforms, mostly in the hopes that by controlling the phase at the sample level, I'd be able to have clean joins between waveforms.

However, I'm still having to crossfade waves together, so that doesn't seem to be a compelling reason anymore.

I'd also been a bit leery of losing frequency resolution. However, that's not really an issue with a sufficiently large sample size.

Rendering with the IFFT is fairly straightforward, although there's a bit of an overhead. But that overhead cost diminishes against the cost of having to create separate oscillators or filters for every harmonic, which is how I've been doing it.

The synthesis process is:

  1. Generate a single glottal pulse at the desired frequency.
  2. Repeatedly copy the pulse into an FFT buffer.
  3. Perform and FFT on the duplicated pulse.
  4. At each bin, multiply the real and imaginary values by the spectral curve's amplitude at that frequency.
  5. Perform an IFFT on the modified data.

The "real" portion returned by the IFFT contains the waveform.

A nice feature of using the IFFT is that the phases from the glottal pulse are automatically incorporated in the final waveform. Crossfading creates smooth connections.

I've been testing this out with synthetic vowel data from prior versions of synSinger, and it seems to work pretty well.

However, when tested on real data, the pulse seems to act like a low-pass filter. That is, not much high-frequency information is being rendered.

That would indicate either that the harmonic analysis of the glottal pulse has an issue, or that the glottal pulse has an issue.

I've tried inverting the pulse, but that wasn't much of an improvement. And I'm seeing lots of issues with waves being reconstructed with some of their harmonics way out of proportion to others.

I've got the feeling this is going to be another long, hard slog.


This free site is ad-supported. Learn more