Baidu Deep Voice Explained: Part 1 – the Inference Pipeline

The Inference Pipeline — Converting New Text to Speech

Let’s now see at a high level how Deep Voice takes a simple sentence and converts it into audio that we can hear.

The pipeline, as we will see, will have the following architecture:

The inference pipeline of Deep Voice. Source: https://arxiv.org/pdf/1702.07825.pdf

Let’s now go through this pipeline step by step to get an understanding of what these pieces are and how they fit together. In particular, we’ll trace the following phrase and see how it is processed by Deep Voice:

It was early spring.

Step 1: Convert Graphemes (Text) to Phonemes

Languages such as English are peculiar in that they aren’t phonetic. For instance, take the following words (adapted from here) that all use the suffix “ough”:

1. though (like o in go)

2. through (like oo in too)

3. cough (like off in offer)

4. rough (like uff in suffer)

Notice how they all have fairly different pronunciations even though they have the same spelling. If our TTS system used spelling as its main input, it would inevitably run into problems trying to reconcile why “though” and “rough” should be pronounced so differently, even though they have the same suffix. As such, we need to use a slightly different representation of words that reveal more information about the pronunciations.

This is exactly what phonemes are. Phonemes are the different units of sound that we make. Combining them together, we can recreate the pronunciation for almost any word. Here are a few examples of words broken into phonemes (adapted from CMU’s phoneme dictionary):

White Room — [W, AY1, T, ., R, UW1, M, .]
Crossroads — [K, R, AO1, S, R, OW2, D, Z, .]

The numbers 1, 2 etc. next to the phonemes represent where the stress of the pronunciation should be placed. Additionally, periods represent empty space in the pronunciation.

So, the first step in Deep Voice will be to simply convert every sentence into its phoneme representation using a simple phoneme dictionary like this one.

Our Sentence

So, for our first step, Deep Voice will have the following inputs and outputs.

Input - “It was early spring”
Output - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

We’ll cover how we train such a model in the next blog post.

Step 2, Part 1: Duration Prediction

Now that we have the phonemes, we need to estimate just how long we should hold out these phonemes while speaking. This is again an interesting problem as phonemes should be held for longer and shorter durations based on their context. Take the following examples surrounding the phonemes “AH N”:

Clearly “AH N” needs to be held out far longer in the first case than the second and we can train a system to do just that. In particular, we’ll take each phoneme and predict how long we’ll hold it for (in seconds).

Our Sentence

Here’s what will happen to our example sentence at this step:

Input - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
Output - [IH1 (0.1s), T (0.05s), . (0.01s), … ]

Step 2, Part 2: Fundamental Frequency Prediction

The fundamental frequency (the blue line) is the lowest frequency the vocal cords produce during the voiced phoneme (think of it as the shape of the waveform). We’ll aim to predict this for each phoneme.

We’ll also want to predict the tone and intonation of each phoneme to make it sound as human as possible. This, in many ways, is especially important in languages like Mandarin where the same sound can have an entirely different meaning based on the tone and accent. Predicting the fundamental frequency of each phoneme helps us do just this. The frequency tells the system exactly what approximate pitch or tone the phoneme should be pronounced at.

Additionally, some phonemes aren’t meant to be voiced at all. This means that they are pronounced without any vibrations of vocal cords.

As an example, say the sounds “ssss” and “zzzz” and notice how the former causes no vibrations in your vocal cords (is unvoiced) while that later does (is voiced).

Our fundamental frequency prediction will also take this into account and predict when a phoneme should be voiced and when it should not.

Our Sentence

Here’s what will happen to our example sentence at this step:

Input - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
Output - [IH1 (140hz), T (142hz), . (Not voiced), …]

Step 3: Audio Synthesis

In the final step, we’ll combine phonemes, durations, and the fundamental frequencies (fO profile) to create real audio.

The final step to creating speech is taking together the phonemes, the durations, and the frequencies to output sound. Deep Voice achieves this step using a modified version of DeepMind’s WaveNet. I highly encourage you to read their original blog post to get a sense of the underlying architecture of WaveNet.

The original WaveNet from DeepMind can have exponentially many different inputs contribute to a single input. Notice the exponential tree structure outlined above. Source: https://deepmind.com/blog/wavenet-generative-model-raw-audio/

At a high level, WaveNet generates raw waveforms allowing you to create all types of sound including different accents, emotions, breaths, and other basic parts of human speech. Additionally, WaveNet can even take this one step further to generate music.

In this paper, the Baidu team modifies WaveNet by optimizing its implementation especially for high frequency inputs. As such, Where WaveNet required minutes to generate a second of new audio, Baidu’s modified WaveNet can require as little as just a fraction of a second as described by the authors of Deep Voice here:

Deep Voice can synthesize audio in fractions of a second, and offers a tunable trade-off between synthesis speed and audio quality. In contrast, previous results with WaveNet require several minutes of runtime to synthesize one second of audio.

Our Sentence

Here are the inputs and outputs at this final step of Deep Voice’s pipeline!

Input - [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…]
Output - see below.

Baidu’s text to speech result. Source: http://research.baidu.com/deep-voice-production-quality-text-speech-system-constructed-entirely-deep-neural-networks/

Summary

And that’s it! With these 3 steps, we’ve seen how Deep Voice takes in a simple piece of text and discovers its audio representation. Here’s the summary of the steps once more:

Convert text into phonemes. “It was early spring”

[IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

2. Predict the durations and frequencies of each phoneme.

[IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .] -> [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…]

3. Combine the phonemes, the durations, and the frequencies to output a sound wave that represents the text.

[IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…] -> Audio

But how do we actually train Deep Voice to be able to carry out the above steps? How does Deep Voice leverage Deep Learning to achieve its goals?

In the next blog post, we’ll cover how each piece of Deep Voice is trained and provide more intuition behind the underlying neural networks. This will be released in a few days!

Baidu Deep Voice Explained: Part 1 – the Inference Pipeline

The Inference Pipeline — Converting New Text to Speech

Step 1: Convert Graphemes (Text) to Phonemes

Step 2, Part 1: Duration Prediction

Step 2, Part 2: Fundamental Frequency Prediction

Step 3: Audio Synthesis

Summary

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112