I've been spending the past month on a portable UI, but most of the work ahead involves very little of that.
Rather, I intend to spend my time on the vocal synthesis code. Here's a partial list of things I need to work on, in no particular order:
New Reclist
The prior version of synSinger used a reclist built from abstract phonemes. It was designed to give good coverage, but I suspect it resulted in more unnatural sounding words. Additionally, it didn't sample diphthongs (although the vowel transitions were sampled).
I'm going to try using a word-based reclist and see if the results are more natural.
Higher Resolution Sampling
The output of the synthesis - especially voiced consonants - sounds very "low-fi". I'm going to experiment with different higher resolution sampling in time and frequency to see if the results improve the synthesis quality.
Synthesis of Missing Phoneme Transitions
Despite the prior version of synSinger having a very good coverage, there are still gaps. There's a time and effort cost to getting every single phoneme transition, and I'm not sure I want to pay that cost.
In lieu of having a fully comprehensive list, I'd like to have a reasonably comprehensive reclist of phoneme transitions. When missing phoneme transition is encountered, synSinger will need to synthesize it.
That's not a huge stretch. After all, synSinger already re-synthesizing the phoneme transitions and morphing to smooth the connections. The difference is that it's essentially cross-fading across spectral envelopes. Creating the missing phonemes would require synthesising all the missing data by interpolating between two points.
And prior versions of synSinger did exactly that. Phonemes were spectral envelopes at a given point in time, and synSinger created all the sustained sounds and transitions from these static spectral envelopes.
This is also exactly what early formant synthesis did - describing phonemes as target formant frequencies, and they interpolated from one set of formants to the next. The result may not have been entirely convincing, but it was very intelligible.
Since the reclist contains the most frequent phoneme transitions, the missing phoneme transitions should be few and far between.
In order to provide the most realistic morphs - so that formants shift to new positions - it's necessary to know the position of formants in the source and destination spectral curves.
However, I'll probably be able to approximate it by creating a lookup table for the position of formants in phonemes for a given singer. If I need to, I can add a simple peak picker to get the best match. That would be a lot simpler than determining the position of formant at every transition.
Better Aspiration Synthesis
Where to being with this? There's no way around it: synSinger's aspiration synthesis is harsh and nasty-sounding.
Currently, synSinger uses the same spectral curve that it uses to generate voiced sounds to generate the accompanying aspirated sounds. And although it should - based on the literature I've read - sound like an approximation of whispered speech, it instead sounds bad.
I can't pretend to explain why it doesn't work, or I'd have fixed it. This has been a outstanding problem with synSinger from the beginning. The only solution I can thing of is more research and spending time closely looking at spectrograms.
Aspiration has always been a problem with synSinger, and I'd like to get it right this time.
Better Harmonic Synthesis
synSinger is basically a vocoder that users glottal pulses to as a carrier wave. The position of the filters in the filter bank move to match the frequency of the harmonics.
This is a fairly straightforward process, but the results are rather low-quality.
There are a number of things that I'm considering changing.
As mentioned before, higher resolution in the time and frequency domain may help resolve this.
But I'm thinking of taking a different approach to synthesis.
If I simply replace the filters with sinewave synthesis, the result is harsh and ringy. This may indicated that there's a problem with sampling.
But instead of band-pass filtering the glottal pulse, I could decompose the pulse to determine the amplitude of the constituent harmonics, as well as their phases.
I could then generate the pulse directly from the sine wave, but with the correct phase information.
Additionally, if I ignored the amplitude information of the waves in the glottal pulse, in theory the result might be a better match to the original voice timbre. The problem is that that timbre should change as pitch changes, so that might not be a good approach.
But using the phase information from the glottal pulse may give better results with less audio quality loss than I'm currently getting.
I'm sure there's more that needs to be looked at, but these are the items at the top of my list. Basically, I'd like higher-quality output and more control over the synthesis parameters.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.