What is Text-to-Speech? | TTS Systems

What is Text-to-Speech?

Text-to-speech (TTS) synthesis systems convert written text into speech for use in voice assistant technologies. At PTW/SIDE, we collect and curate data for use in the TTS systems of our big tech clients.

To build a TTS synthesis model, our clients need audio, matching text, and some additional information about the text, such as how the individual words should be pronounced. All of these data points are the input used to train a model which can automatically output speech and correctly read text aloud. TTS is used in smartphones and smart speakers, assistive technologies for people with vision impairment, train stations and airports, and increasingly more applications.

The first stage in a TTS data project is to recruit qualified linguists. The major prerequisite for candidates is that they have studied formal linguistics, particularly phonetics and phonology. They must have experience (even if just from a university project) in transcribing the pronunciation of words using the international phonetic alphabet (IPA) or similar. The linguists must also be native speakers of the language for which we are recruiting.

Potential candidates must pass a screening test (designed by the client) –consisting largely of pronunciation transcriptions, and an interview.

Many languages vary widely across different regions. For the purposes of TTS systems, however, it is necessary to define a particular variety as specifically as possible. The questions linguists should consider to select a target dialect for the data collection include:

Which dialect is used in broadcast news?
Which dialect is spoken in the capital city or main urban center?
Do speakers code-switch (mix two languages or more while speaking)?
Is there a difference between the spoken and written varieties of the language?

A TTS system also requires a linguistic definition of the language in question. Our linguists help to define the set of phonemes (unique sounds) permitted in the language, as well as which combinations of sounds are valid (phonotactics).

The system uses this information, and the training data (audio and matching text), to learn how to convert letters into phonemes. Our goal is a model that can automatically pronounce any word it comes across. Unfortunately, it isn’t always that simple. Linguists often have to transcribe the pronunciation of words explicitly, particularly for less common words, foreign loanwords, or words whose spelling does not follow the usual rules of pronunciation. All these spellings and pronunciations are put into a special dictionary, called a pronunciation lexicon.

The next stage in a TTS data project is to prepare a script that will be read aloud and recorded by a voice artist. The goal of this script is to capture all the phoneme combinations permitted in the language. We also need to ensure that the script is easily readable. Finally, every word in the script has its pronunciation checked and transcribed in the lexicon, to ensure that the model receives high-quality training data.

The penultimate stage in a TTS data project is to record audio. For this we need to find a voice actor who speaks the target dialect natively. They must also be a good sight-reader, as the script lines are not learned in advance.

The selected voice talent will then record the script we have previously curated, and we will update any pronunciations in the lexicon as necessary (if the voice talent regularly pronounces a word in a certain way, for example, we ensure that the pronunciations in the lexicon are updated to match as closely as possible)

SIDE specializes in certifying studios and setting up teams specifically for TTS recordings all around the world. We have cast and recorded voices in over 40 languages and locations.

Casting and recording for TTS is very different from other types of voice-over. Each line must be read with a strictly even projection and consistent tone and volume. The reading should also be natural and not overly performative. This is to ensure uniform training data, which generally leads to smoother synthesized speech from the TTS voice.

Our linguists then must evaluate the recordings to ensure that the audio matches the text, and that the voice talent’s pronunciations are valid and correspond to those stored in our lexicon. This is the stage at which we can update the lexicon entries if needed.

Because everything is done with client tools, there is no pass off of data at the end of the evaluation stage. Once we have evaluated the Audio quality and accuracy, SIDE steps out of the process and the client team is ready to use their new audio files. TTS is becoming a larger part of how tech companies are incorporating accessibility features into their work and SIDE is happy to be a part of it. Crafting your strategy for incorporating something like TTS into your work can be tricky and we offer a free consultation on what working with SIDE on TTS could look like for your business.

Related Localization QA

Respect for All Players: Why LQA Really Matters

Meet the Team: Ana Laura Duch | L&D Stories

Meet the Team: Bertram Tinhof | L&D Stories