The most basic type of phoneme combination is probably vowel/consonant. Here's a screenshot from Praat of me saying /S AH S AH S/:

"S AH S AH S"

Here, I've used the letter x to represent silence. There are a total of 6 unique transitions here:

  1. x_s Silence to /S/
  2. s_ah The phoneme /S/ to /AH/
  3. ah is sustained and stable, so it's not a transition
  4. ah_s is a transition from the phoneme /AH/ to /S/
  5. s_ah is a transition from the phoneme /S/ to /AH/
  6. ah is sustained and stable, so it's not a transition
  7. ah_s is a transition from the phoneme /AH/ to /S/.
  8. s_x is a transition from /S/ to silence

So it looks like there's some redundancy here. But for the sake of naturalness, we can consider the initial transitions x_s_ah and the final transition ah_s_x different from those in the center - which are faster - and treat the them as a single chunk:

"S AH S AH S"

This is more typically how CV (consonant/vowel) and VC (vowel/consonants) are split up.

Looking at the /S/ sounds, it's clear that - like the vowel - it's sustained and stable, so you could just do a crossfade and combine it smoothly with other /S/ sounds.

This is the case with other fricatives and nasals, such as /CH, DH, F, H, N, M, S, Z/.

The next set of consonants to consider are the stop consonants, /B, CH, D, G, JH, K, P, T/. Here's an example of "DERDERD":

"DERDERD"

With "stop" consonants, there's a short stop before the sounding of the consonant. I've chosen to place the split right at the stop, which means that stops of different sorts can be combined. For example er_d j_ah would not sound the /J/ sound at all, since the stop consonant is placed on the right side of the transition.

Strictly speaking, you'd want to combine the voiced stops /B, D, G, JH/ with other voices stops, and the unvoiced stops /CH, K, P, T/ with unvoiced stops, as there's a quiet but audible voicebar prior to the stop.

That leaves the liquid consonants, / L, R, W, Y/. What distinguishes these sounds in movement they make while being sounded, so they lack a stable portion:

"LERLERL"

The division point for these sounds is prior to the phoneme.

So how to handle all the CC (consonant to consonant) transitions?

They have to be recorded separately. Currently, synSinger has a bunch of prefix and postfix strings of consonants. Here's an example of /S P R/:

"SPR"

Since the /R/ is a liquid consonant, only the initial portion is captured in the transition. This can then be prepended before a r_V transition (where V is the target vowel) and create a smooth transition.

Here's a short list of example prefixes:

-s_k_y -k_v -sh_t -s_p_y -h_l -f_w -th_w -zh_w -s_k_l -l_w -m_l

There's a similar list of postfix consonant clusters, as well as assorted singleton consonants such as +p or +k- that can be attached. The don't sound very good, but they're better than the synthesizer rendering nothing.

I've been playing with the idea of perhaps taking a more clever approach, but it's probably better to just stick with this process for now.


This free site is ad-supported. Learn more