Text-To-Speech (TTS) Services

Behind the Scenes: Text-to-Speech

In addition to our game localization services, our localization and audio teams also provide text-to-speech services. We all have some experience with text-to-speech (TTS). You may have had online text spoken aloud or a maps application read you turn-by-turn directions. Regardless of its particular use, the technology has advanced since its inception in the late 1970s. Gone are the days of eerie, mechanical simulacra of actual human voices. These days, what emanates from speakers is much warmer and comprehensible. But how is this accomplished?

Shedding light on the TTS process is Micaela Bester, Linguistic Program Manager, Localization. Micaela manages all linguistic TTS work for our biggest client. She studied linguistics at university, during a time when there were no jobs quite like it.

“My choices were just going into academia or speech pathology,” she says. “I ended up teaching English in Italy.” Then suddenly, a chance was dropped into her lap, working for Google as a Data Annotator. She left after four years, following a stint in Dublin, Ireland, desiring change. “Fortunately, I discovered PTW in December 2019 and started as a Program Manager, working with some of the biggest tech companies in the world.”

Locating talent

The process begins with a request. “The client contacts us and says we want to build an assistant voice in Hindi, for example,” explains Micaela. “We start with recruitment. Ask HR to post in all the usual channels.” Most applicants come from specialist job boards and resources where linguistics roles are advertised, usually academic in nature. “Our teams normally look for translators, but for these projects we don’t want that; we want phonologists, someone who studies the speech sounds in a given language. It’s niche and difficult to find people.”

The success rate for finding qualified respondents is low, so the process can take some time. “We get a lot of applications from translators who think they can do the work, but it’s a very specific skillset.” People with the right competencies are often already employed full-time, so they’re not looking for more. This job is part-time, and project based, so not steady work. “It’s good for people in academia who sometimes have extra hours between research, etc.”

Does this affect project timelines? “Yes, having already produced most of the major languages, we’re moving further afield into developing countries, which makes it harder to find phonologists. We used to be able to finish a typical project in four months; now it takes longer.”

During the selection process, successful respondents are asked to tell us a little bit about their own language; features of the standard dialect. “For example, for the Sydney, Australian English dialect, you listen for words like “chance”, which is pronounced differently in different regions.”

They also get questions about the language itself. For instance, which consonant clusters are ‘illegal’? Is a word allowed to start with the letters ‘str’, for example? This cluster is legal in English, but not necessarily in other languages.

Writing the script

The script has often already been generated by the client, but linguists must proofread it; not just to make sure it’s grammatically correct, but annotating it for pronunciation of things like addresses or proper names.

Numbers are a good example of different pronunciation requirements: should ‘1990’ be pronounced 'nineteen ninety’ or should it be pronounced ‘one thousand, nine hundred ninety’? The client produces the script, usually by harvesting the web. It’s composed of completely random, mostly gibberish sentences which the linguists must rewrite for readability.

Linguists must transcribe how each sentence is pronounced, using a phonetic alphabet. They’re given a list; a mapping between sounds and symbols that they should use to represent those sounds. However, PTW does not use the classic International Phonetic Alphabet, or IPA, which is a phonetic notation used to guide pronunciation.

“It’s not super useful, not keyboard-friendly. We use SAMPA (Speech Assessment Methods Phonetic Alphabet). It helps if the Voice Director and Project Manager are familiar with it,” explains Micaela. SAMPA was developed in the late 1980s to be computer-readable.

Recording the script

Once the script has been sanitized and annotated, our voice production specialists SIDE take over the process. The casting phase begins in search of voice talent, matching the persona style requested by the client. The Voice Director assesses consistency, endurance, enunciation, prosody, and whether peculiarities of dialect can be replicated.

The speaker must be native and have the desired accent. They do not necessarily have to be an established voice talent but the qualities and skills required tend to be found in people who do speak professionally.

“And it’s not a simple recording of single words, as you might think,” reveals Micaela. “When people read words in isolation, they tend to pronounce them unnaturally. So, most of the script is composed of sentence-level common phrases.”

Older TTS systems never sounded natural because the system worked by literally chopping up words from sentences and pasting them back together, which produces weird vocal effects. The voice talent must sight-read very quickly. Usually, 10,000 lines of dialogue are recorded over a few weeks.

The Voice Director makes sure the persona style is always kept during the full production, and that the voice quality remains consistent. If the voice talent has a slight cold or sounds a little different, recordings are suspended.

How long is an average daily recording session? “About four hours. This is mainly to preserve the vocal cords.” Files are sent for processing on a rolling basis as each script is recorded; they aren’t all processed at the same time at the end of the recording.

“SIDE uploads to linguists for analysis,” continues Micaela. “Linguists check against the script itself, but also against the pronunciations. The tools we use are pretty good, compared to others.” They also listen for audio issues, but the Post-Production team usually catches them before they get to the linguists.

Processing the data

Once the files are checked, they are delivered to the client for processing by their proprietary algorithms. “It’s a bit of a black box. It’s all Machine Learning. We say goodbye to all the data at that point. The application learns to synthesize the text; it’s a training model. The voice synthesis gets tested and checked internally until it’s satisfactory. If they find errors, or the voice sounds bad, then some re-recordings are made to fix the problems.”

What are the most challenging aspects of this position? “Many, including herding the freelancers all over the world,” laughs Micaela. “Moving toward languages that are very different is the next big hurdle. Not all languages work like English. The systems aren’t set up to deal with tone. No lexicon exists, so we must build one from scratch.”

Text-to-Speech in our everyday lives is already commonplace, and more applications will likely arise. While the process for its execution seems straightforward as we’ve discussed above, it’s clear that new methodologies must be developed to serve all the languages of the world we haven’t touched yet.

But these challenges that require active thought and research are what keeps Micaela engaged in the work. She looks forward to continuing to develop her understanding of the world’s languages through her work at PTW.

Related Localization

From Localization & LQA to Player Support: How to Prepare Your Game for the Taiwan Market

The Ins and Outs of Gaming Terminology in Video Game Localization

How to Deliver Machine Translation for Games