synsinger posted: " I was reading through the details of how Praat implemented their glottal pulse, and noticed that not only does Praat have a simple algorithm for generating a glottal pulse, but it's got code for generating the derivative as well. So I've replaced the "
I was reading through the details of how Praat implemented their glottal pulse, and noticed that not only does Praat have a simple algorithm for generating a glottal pulse, but it's got code for generating the derivative as well.
So I've replaced the glottal pulse I had been using with the derivative glottal flow, and the results are pretty good.
However, when I changed over to another vocal sample, I noticed there was a noise burst at the beginning of each vowel.
At first, I thought that perhaps I was overloading band-bass filters, and so added some logic which slowly brought up the pulse amplitude over the course of several pulses. But I was still getting the burst.
So I disabled the bandpass filters, and it was there at the front of the pulse train:
So I had a closer look at the beginning, and sure enough - the code thought that the very beginning of the wave wasn't voiced, so it passed it to the code to render the unvoiced frames:
Looking in Praat, it looks like it properly identifies the start of the waveform:
The most probably culprit is that my code simply isn't properly detecting the frames that fall into the pulses.
Another possibility may be that the FFT is looking too far forward, so it's "leaking" spectral data from frames ahead of the 5ms it's currently looking at. But the wave amplitudes don't really support this explanation.
So it's likely to simply be a coding error.
But it's got me thinking of "mixed excitation".
synSinger is a "classical" vocoder. The voicing decision is binary: either voicing is on, or off.
If it's on, a glottal pulse is fed through the filters.
If it's off, noise is fed through the filters.
However, this binary option is known to contribute to the resulting speech sounding synthetic.
Back in the early '90s, MELP (mixed excitation linear prediction) was introduced to help create more natural speech.
In a nutshell, MELP split frequency bands into "voiced" and "unvoiced" frequencies, and determined how much energy to provide to the voicing and white noise based on these bands.
Additionally, it determined when voicing was aperiodic by looking at the voicing strength, which would cause it to then vary the pitch by up to 25%. The helped create a more natural sound.
I'm already storing amplitudes in Mel bands, so it should be interesting to see how well it can detect mixed-modes by looking at the bands.
This would mostly be useful for phonemes that are the voiced versions of unvoiced sounds, such as /V/ and /ZH/.
But the first thing I should do is add the pulse information to the display, so I can see where the pulses are to debug this problem.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.