The Inference Pipeline — Converting New Text to Speech
Let’s now see at a high level how Deep Voice takes a simple sentence and converts it into audio that we can hear.
The pipeline, as we will see, will have the following architecture:
Let’s now go through this pipeline step by step to get an understanding of what these pieces are and how they fit together. In particular, we’ll trace the following phrase and see how it is processed by Deep Voice:
It was early spring.
Step 1: Convert Graphemes (Text) to Phonemes
Languages such as English are peculiar in that they aren’t phonetic. For instance, take the following words (adapted from here) that all use the suffix “ough”:
1. though (like o in go)
2. through (like oo in too)
3. cough (like off in offer)
4. rough (like uff in suffer)
Notice how they all have fairly different pronunciations even though they have the same spelling. If our TTS system used spelling as its main input, it would inevitably run into problems trying to reconcile why “though” and “rough” should be pronounced so differently, even though they have the same suffix. As such, we need to use a slightly different representation of words that reveal more information about the pronunciations.
This is exactly what phonemes are. Phonemes are the different units of sound that we make. Combining them together, we can recreate the pronunciation for almost any word. Here are a few examples of words broken into phonemes (adapted from CMU’s phoneme dictionary):
- White Room — [W, AY1, T, ., R, UW1, M, .]
- Crossroads — [K, R, AO1, S, R, OW2, D, Z, .]
The numbers 1, 2 etc. next to the phonemes represent where the stress of the pronunciation should be placed. Additionally, periods represent empty space in the pronunciation.
So, the first step in Deep Voice will be to simply convert every sentence into its phoneme representation using a simple phoneme dictionary like this one.
Our Sentence
So, for our first step, Deep Voice will have the following inputs and outputs.
- Input - “It was early spring”
- Output - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
We’ll cover how we train such a model in the next blog post.
Step 2, Part 1: Duration Prediction
Now that we have the phonemes, we need to estimate just how long we should hold out these phonemes while speaking. This is again an interesting problem as phonemes should be held for longer and shorter durations based on their context. Take the following examples surrounding the phonemes “AH N”:
Clearly “AH N” needs to be held out far longer in the first case than the second and we can train a system to do just that. In particular, we’ll take each phoneme and predict how long we’ll hold it for (in seconds).
Our Sentence
Here’s what will happen to our example sentence at this step:
- Input - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
- Output - [IH1 (0.1s), T (0.05s), . (0.01s), … ]
Step 2, Part 2: Fundamental Frequency Prediction
We’ll also want to predict the tone and intonation of each phoneme to make it sound as human as possible. This, in many ways, is especially important in languages like Mandarin where the same sound can have an entirely different meaning based on the tone and accent. Predicting the fundamental frequency of each phoneme helps us do just this. The frequency tells the system exactly what approximate pitch or tone the phoneme should be pronounced at.
Additionally, some phonemes aren’t meant to be voiced at all. This means that they are pronounced without any vibrations of vocal cords.
As an example, say the sounds “ssss” and “zzzz” and notice how the former causes no vibrations in your vocal cords (is unvoiced) while that later does (is voiced).
Our fundamental frequency prediction will also take this into account and predict when a phoneme should be voiced and when it should not.
Our Sentence
Here’s what will happen to our example sentence at this step:
- Input - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
- Output - [IH1 (140hz), T (142hz), . (Not voiced), …]
Step 3: Audio Synthesis
The final step to creating speech is taking together the phonemes, the durations, and the frequencies to output sound. Deep Voice achieves this step using a modified version of DeepMind’s WaveNet. I highly encourage you to read their original blog post to get a sense of the underlying architecture of WaveNet.
At a high level, WaveNet generates raw waveforms allowing you to create all types of sound including different accents, emotions, breaths, and other basic parts of human speech. Additionally, WaveNet can even take this one step further to generate music.
In this paper, the Baidu team modifies WaveNet by optimizing its implementation especially for high frequency inputs. As such, Where WaveNet required minutes to generate a second of new audio, Baidu’s modified WaveNet can require as little as just a fraction of a second as described by the authors of Deep Voice here:
Deep Voice can synthesize audio in fractions of a second, and offers a tunable trade-off between synthesis speed and audio quality. In contrast, previous results with WaveNet require several minutes of runtime to synthesize one second of audio.
Our Sentence
Here are the inputs and outputs at this final step of Deep Voice’s pipeline!
- Input - [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…]
- Output - see below.
Summary
And that’s it! With these 3 steps, we’ve seen how Deep Voice takes in a simple piece of text and discovers its audio representation. Here’s the summary of the steps once more:
- Convert text into phonemes. “It was early spring”
- [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
2. Predict the durations and frequencies of each phoneme.
- [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .] -> [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…]
3. Combine the phonemes, the durations, and the frequencies to output a sound wave that represents the text.
- [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…] -> Audio
But how do we actually train Deep Voice to be able to carry out the above steps? How does Deep Voice leverage Deep Learning to achieve its goals?
In the next blog post, we’ll cover how each piece of Deep Voice is trained and provide more intuition behind the underlying neural networks. This will be released in a few days!