What is SSML & How Is It Significant in Voice Applications?
Introduced back in 2004, SSML or Speech Synthesis Markup Language is a markup language that helps standardize the way our digital devices communicate with us. A markup language is simply a system that can annotate a document in a way that syntactically distinguishes it from text-based language. Speech synthesis markup language is designed to provide aid in generating synthetic speech for web-based and voice-based applications and helps authors control aspects like volume, pitch, rate, and pronunciation across various synthesis-capable platforms like voice apps. Essentially, it gives the author of the digital device, the ability to add in speech effects and control how the speech is generated.
While this is incredibly neat, understanding how to implement and use SSML is critical if you want to create better voice bots, voice skills, or voice-enabled applications.
How Does SSML Work & What Are Its Key Concepts?
The intended use of speech synthesis markup language is to improve the quality of the synthesized content. The idea here is to make it read out like natural speech and make conversations seem human-like by changing how the text is voiced and read out by the voice application or voice assistant. SSML will do the following:
- Interact with other markup languages like VoiceXML, SMIL, XHTML, HTML, CSS, and XML, DOM so that all language text can be synthesized via standard guidelines.
- Consistently provide a predictable amount of control over voice output across voice-enabled platforms and across all types of speech synthesis implementation.
- Will have internationalization. Meaning, it can enable speech output for a large number of languages across numerous types of documentation.
For there to be automatic generation of speech from text or from an annotated text, the document must render using all information that is contained in the markup. Meaning, the document must render as a spoken output without requiring additional input. For this to work, a synthesis process needs to be completed and SSML tags need to be used to denote how the text is spoken.
What Are SSML Tags & How Are They Used?
In order for authors to change how the voice output is generated, tags must be used and inserted into the text. These tags are denoted by arrow icons, such as <soft> (not an actual SSML tag) and enclose a specific section of text. The computer can use these tags to understand how the text should be generated via spoken word. For voice applications using Alexa Skills, the following SSML tags can be used.
- Audio
- Break
- Say-as
- Speak
- Sub
- Voice
- W
- Emphasis
- Phoneme
- Prosody
- S
- Amazon: domain
- Amazon: effect
- Amazon: emotion.
To give you an example of a very easy use of an SSML tag, would be the use of the break or pause tag.
Hi Anna! <break time=”3s”/> What music would you like today?
In the above example, Alexa addresses Anna, the owner of the Alexa device, breaks for 3 seconds and then asks the associated question. You can have Alexa break in either seconds or milliseconds but requires proper tagging to work. You can also have Alexa say numerous greeting questions; it just depends on what voice app you are working in and want it to convey.
Combining SSML Tags is More Than Possible
With Alexa Skills, you can combine more than one SSML tag together to create different effects. For instance, you can use Amazon’s emotion and say-as tags together to tell Alexa to speak an entire string of words in a specific way and how they are spoken. An example of this would be, asking Alexa to countdown numbers, you would want these spoken as digits rather than as words. To have this occur, you would need to use the “say-as” tag and if you wanted it said in an excited way, you would need to use the “amazon:emotion” tag. Here is an example from Amazon’s Development Page for Custom SSML tags.
<speak>
<amazon:emotion name=”excited” intensity=”medium”>
Five seconds till lift off!
<say-as interpret-as=”digits”>54321</say-as>.
Lift off!
</amazon:emotion>
</speak>
Alexa would say the above string of lines with an excited voice and voice out the “54321” in digit form. You could easily insert in the emotion “sad” or “frustrated” instead of “excited”, immediately changing the tone of the spoken output.
Are There Incompatible Tags or Tags That Cannot Be Put Together?
Not all tags can be combined together or applied to the same section of speech. According to Amazon’s Developer Page, with Amazon Alexa, you cannot combine the amazon:domain, amazon:emotion, speechcons, emphasis, voice, and prosody together. However, speechcons which are implemented with the say-as tag alongside interpret-as can be combined with other tags that have a value attribute. So, you could easily combine amazon:emphasis with a say-as ordinal command, causing Alexa to say “first”, “second”, or “third”, with emphasis (ordinal being the position of something in a series). It is important to note that SSML tags with Google and SSML tags with Amazon may be different and may have different available combinations.
What Are Common SSML Tags & How Are They Used?
- The audio tag lets you input a URL for an MP3 file that can then be played by Alexa while she renders a response. You can use this tag to insert a pre-recorded audio clip with your response or even include sound effects alongside your response. The audio tag is accompanied by the attribute src which specifies the audio source. It is important to note here that an MP3 but be valid, cannot be longer than 240 seconds, and must have a bit rate of 48 kbps. It also must be hosted on an HTTPS endpoint with a trusted SSL certificate.
An Example of the Audio Tag
<speak>
Welcome to Ride Hailer.
<audio src=”soundbank://soundlibrary/transportation/amzn_sfx_car_accelerate_01″ />
You can order a ride or request a fare estimate.
Which will it be?
</speak>
- The break tag allows you to input a pause in the speech which is set with either the strength or time attributes. The strength attribute determines whether adjacent words are separated by a single comma or separated with a sentence or paragraph break. The time attribute determines the duration of the pause, up to 10 seconds in length and requires a unit of time to be inserted with the tag (seconds or milliseconds).
An Example of the Break Tag
<speak>
There is a three second pause here <break time=”3s”/>
then the speech continues.
</speak>
- The emphasis tag will allow you to change the rate and the volume of the speech, making the emphasis of a word be either louder, slower or quieter and faster. The emphasis tag is controlled by the level attribute which denotes whether the value is strong, moderate, or reduced. A strong emphasis will increase the volume and slow down the speaking rate, making it louder and slower, whereas, a moderate emphasis will increase the volume and slow down the speaking rate but without as much intensity as the strong attribute. Using the reduced value, this will decrease the volume, speed up the rate, and make it faster and softer.
An Example of the Emphasis Tag
<speak>
I already told you I
<emphasis level=”strong”>really like</emphasis>
that person.
</speak>
- The prosody tag is what modifies the pitch, rate, and volume of the tagged speed. The attributes being rate, pitch, and volume. You can set the rate of the speech to either super slow (x-slow), slow, medium, fast, and very fast (x-fast), raise or lower the tone in the same manner, and change the volume with similar tags.
While the above examples are by no means a complete rendition of what you can do with SSML tags nor are they all of the tags that are available, they do give you a good idea of what is possible. For a full look into SSML tags with Amazon Alexa, please visit the Amazon’s Developer Page. If you are looking for how to work with SSML tags for Google Assistant devices, please visit their Developer Reference Page.
So Why Use SSML in Voice Apps?
The Speech Synthesis Markup Language is significant in the realm of voice applications, because it enables authors to create a better user experience for the customer. When a customer turns on their voice-enabled device and can have a human-like conversation, it ends up giving them a seamless interaction that is filled with understanding and emotion. This experience can only come with the use of SSML’s rich repository of tags.