To recap: this summer, I decided to revisit the resynthesis routines that synSinger was using in hopes of creating better output. I ended up getting fairly good results with one method, only to have it fail badly when I starting testing it with other data. Sinusoidal synthesis still gives excellent results, but I need the flexibility of being able to modify vocal attributes such as tension and the glottal pulse - something that sinusoidal synthesis doesn't automatically supply.

Additionally, I'm unhappy with how robotic the output as, and much of it seems to stem from the method of recording the phonemes.

While I've been mulling these issues over, I've been doing some research into neural networks. While there is lots of research into successful TTS (Text to Speech) synthesis, singing synthesis is a bit of a different animal.

Many of the features that are desirable in TTS are undesired in singing synthesis.

For example, the prosody of speech - the pitch line, emphasis, and phoneme duration - are all automatically baked into TTS.

In singing synthesis, these are manually specified.

One solution - adopted by Sinsy - is to have TTS initially generate the speech, and then pitch and time-shift the results to the pitch and timing constraints of the song.

In some ways, the constraints of singing synthesis are simpler than speech synthesis. For example, since the user needs the ability to specify down to the phoneme level, there's no need for a language model to be created, as a phonetic dictionary will do. It's up to the end user to handle homographs - words that are written the same, but sound different.

Concatenation synthesis programs like Vocaloid and UTAU typically have a large set of pre-recorded phoneme pairs that can be assembled to create singing.

This requires a lot of recordings. Because errors can creep into the process, it's helpful to have multiple recordings of phonemes. The non-AI version of SynthesizerV has three different takes of each phoneme to choose from, which can be very useful when the primary recording is wrong.

The promise of neural networks is that they can "learn" from examples, so potentially handle atypical phoneme pairings more robustly. But this isn't guaranteed, by any means.

The process of training a neural network to recognize speech isn't the same as the process of training it to generate speech. But both are generally required, as the process of training speech generation requires lots of training data, which generally requires the training speech to be automatically labeled.

And while the neural network generated speech can often be very good, because of the constraints of singing synthesis, it's not necessarily better than concatenative synthesis.

I've been playing around with the idea of using a neural network to do singing synthesis for some time. and have more recently been looking at it more seriously.

There are a lot of questions I still need to get answered, such as how to handle time in a controlled manner, and whether the output would be better.

In the meantime, I'll continue to see if I can get better synthesis results.


This free site is ad-supported. Learn more