Tech company DeepMind is using their artificial intelligence and machine learning chops to improve the quality of computer generated speech, with implications for customization and dynamism of IVR and other automated systems.
When a customer calls in to your auto attendant, you want them to feel comfortable, and certainly be able to understand the information being relayed to them. By far the most effective method to achieve this is to have a human recording of every single phrase and response the system may need to make. People have been listening to speech since birth, so it feels natural, especially when different languages and dialects are involved.
There are several problems with this however, like if you don’t have the data-storage capacity for all of the necessary recordings. Or the time or expense is too great to hire someone to speak all of the necessary prompts? To make it more complex, most modern IVR systems have dynamically generated data, that may change on a regular basis, or even be refreshed with each and every call, making it impossible to predict what recordings you would need to have on hand. In these situations, another solution is required: generating and playing back speech programmatically.
There are a few ways that this can be accomplished. The first is to have recordings of individual words or short phrases, that can be strung together as needed. This is most useful when there are varying but predictable patterns, such as dollar amounts, website URLs, or calendar dates. The drawback is that new prompts cannot be added on the fly, and if the original voice talent is unavailable for a future addition, you may need to re-record everything, or live with the disjointed experience of having multiple voices for different sections.
The second method is recording just tiny bits of words called speech fragments, that can be combined in numerous ways to form new and different words. Sounding out a word like “DiRAD” as several component sounds like duh-eye-errr-aaa-duh” is what powers the speech of entities like Apple’s Siri or Microsoft’s Cortana. This method results in fairly coherent speech, but the difficulty in setting it all up is beyond the scope of most businesses, and also lacks flexibility if a need arises to adjust the voice in any way, since it is based on the real voice of one person.
The third method is similar but entirely electronic. The component sounds are generated based on rules about the ways letters fit together, and while this is very flexible and dynamic, it results in the most robotic sounding speech, and occasional flubs with words that don’t quite follow pronunciation conventions. (Queue or colonel for instance) None of these options, then, are completely ideal.
DeepMind, purchased by Google in 2014, has a new solution to this quandary, that is already pushing the limits of what computer generated speech is capable of. It is called WaveNet, and it is sort of a combination of all the methods described above. They start out with recordings of real human speech, and take samples of the sound waves at a rate of over 16000 times per second to build a model of what those sound waves look like. Then they reproduce the same wave forms electronically, resulting in very similar sounds to the human ear.
This is where DeepMind’s expertise in artificial intelligence and machine learning comes into play. The more samples that are fed into the system, the better WaveNet is able to accurately predict what a given word should sound like, leading to a more accurate electronic reproduction. By modeling the sounds after real human speech, they are able to achieve a very realistic and understandable voice, that can be modulated or manipulated in any way they see fit, without any reliance on a particular human speaker, since the result is an average of all samples that have undergone this analysis.
In blind tests conducted to see if users preferred WaveNet speech to other methods, scores were 50% higher for WaveNet than the current leading technologies, although it still lost by a slim margin to actual human speech recordings. But in an environment where actual recordings are not practical for one of the reasons discussed above, this will absolutely be the next best thing. In the long run, this will lead to more helpful and coherent automated systems, with near-limitless potential for customization and dynamically generated content. We’ll be watching these developments closely, with an eye toward providing better IVR systems for all.