SSMLBuilder
What is SSML?
Speech Synthesis Markup Language (SSML) is a markup language used to enhance synthesized speech output. It provides developers with a way to control various aspects of speech synthesis, such as pronunciation, intonation, volume, and speed, by using predefined tags and attributes.
SSML allows developers to create more natural and expressive speech output, making interactions with voice-based applications more engaging and user-friendly.
These use cases demonstrate how SSML can be applied in various contexts to improve the quality, expressiveness, and accessibility of synthesized speech output, ultimately enhancing the overall user experience.
-
Narration with Emphasis: In storytelling applications or audiobooks, developers can use SSML to emphasize specific words or phrases to convey emotions or highlight key points in the narrative. For example, during a suspenseful moment in a story, the narrator's voice could be slowed down for dramatic effect using SSML.
-
Interactive Voice Responses: In voice-based applications such as virtual assistants or customer service bots, SSML can be used to provide more natural and engaging interactions with users. For instance, developers can use SSML to insert pauses between sentences to mimic natural speech patterns or adjust the pitch and volume of the voice to convey empathy or urgency.
-
Educational Content: SSML can be valuable in educational applications where synthesized speech is used to deliver instructional content or quizzes. Developers can use SSML to modify the speaking rate to accommodate different learning paces or employ phonetic pronunciation tags to ensure correct pronunciation of specialized terms or foreign words.
-
Accessibility Features: For applications designed to assist users with visual impairments or reading difficulties, SSML can play a crucial role in enhancing accessibility. Developers can use SSML to provide auditory cues, such as tone changes or speech emphasis, to indicate important information or user interface elements.
SSMLBuilder
The SSMLBuilder
class simplifies the creation of SSML strings by providing intuitive methods to control various aspects of speech synthesis. It offers a range of methods for manipulating text, adjusting timing and prosody, specifying voice and phoneme characteristics, and more.
from ovos_utils.ssml import SSMLBuilder
class MySkill:
def handle_intent(self, message):
# Create an instance of SSMLBuilder
ssml_builder = SSMLBuilder()
# Generate SSML
ssml_text = ssml_builder.sentence("Hello, world!").pause(500, "ms").say_slow("How are you today?").build()
# Output:
# '<speak>\n<s>Hello, world!</s> <break time=500ms/><prosody rate='0.4'>How are you today?</prosody>\n</speak>'
# Speak the SSML text
self.speak(ssml_text)
Text Manipulation
sub(alias, word)
: Replaces a word with a specified alias.emphasis(level, word)
: Emphasizes a word with a specified level.parts_of_speech(word, role)
: Specifies the usage or role of a word.pause_by_strength(strength)
: Inserts a pause with a specified strength.sentence(text)
: Wraps text with<s>
tags to denote a sentence.say_emphasis(text)
: Emphasizes the text strongly.say_strong(text)
: Modifies the vocal-tract length to increase speech intensity.say_weak(text)
: Modifies the vocal-tract length to decrease speech intensity.say_softly(text)
: Modifies the phonation to produce softer speech.say_auto_breaths(text)
: Adds automatic breaths to the speech.paragraph(text)
: Wraps text with<p>
tags to denote a paragraph.audio(audio_file, text)
: Embeds audio with specified text.
Timing and Prosody
pause(time, unit)
: Inserts a pause for a specified duration.prosody(attribute, text)
: Modifies prosodic attributes of the text such as pitch, rate, or volume.pitch(pitch, text)
: Changes the pitch of the text.volume(volume, text)
: Modifies the volume of the text.rate(rate, text)
: Adjusts the speaking rate of the text.
Voice and Phoneme
say(text)
: Adds normal speed text to SSML.say_loud(text)
: Increases the volume of the text.say_slow(text)
: Slows down the speaking rate of the text.say_fast(text)
: Speeds up the speaking rate of the text.say_low_pitch(text)
: Lowers the pitch of the text.say_high_pitch(text)
: Raises the pitch of the text.say_whispered(text)
: Converts text into whispered speech.phoneme(ph, text)
: Specifies the phonetic pronunciation of the text.voice(voice, text)
: Specifies the voice to use for the text.whisper(text)
: Converts text into whispered speech.
Build and Utility
build()
: Constructs the final SSML string.remove_ssml(text)
: Removes SSML tags from the given text.extract_ssml_tags(text)
: Extracts SSML tags from the given text.
SSML Support in TTS Plugins
OVOS TTS plugins implement support for SSML, ensuring that SSML content is processed accurately during speech synthesis. Let's take a closer look at how SSML handling works within the TTS
abstract class:
-
SSML Validation: The
validate_ssml()
method checks if the TTS engine supports SSML. Unsupported or invalid SSML tags are removed from the input text to ensure proper processing. -
SSML Tag Handling: Supported SSML tags are processed by the TTS engine during synthesis. Unsupported tags are removed, while supported tags are modified or retained based on the implementation of the
modify_tag()
method.
# default handling of ssml, advanced plugins may override this method
def modify_tag(self, tag):
"""Override to modify each supported ssml tag.
Arguments:
tag (str): SSML tag to check and possibly transform.
"""
return tag
def validate_ssml(self, utterance):
"""Check if engine supports ssml, if not remove all tags.
Remove unsupported / invalid tags
Arguments:
utterance (str): Sentence to validate
Returns:
str: validated_sentence
"""
# Validate speak tags
if not self.ssml_tags or "speak" not in self.ssml_tags:
self.format_speak_tags(utterance, False)
elif self.ssml_tags and "speak" in self.ssml_tags:
self.format_speak_tags(utterance)
# if ssml is not supported by TTS engine remove all tags
if not self.ssml_tags:
return self.remove_ssml(utterance)
# find ssml tags in string
tags = SSML_TAGS.findall(utterance)
for tag in tags:
if any(supported in tag for supported in self.ssml_tags):
utterance = utterance.replace(tag, self.modify_tag(tag))
else:
# remove unsupported tag
utterance = utterance.replace(tag, "")
# return text with supported ssml tags only
return utterance.replace(" ", " ")
Platform-Specific SSML Handling
Some TTS plugins, like the PollyTTS plugin, may support platform-specific SSML tags that are not part of the standard specification. For example, Amazon Polly supports additional SSML tags specific to Amazon's speech synthesis service.
Let's take a closer look at how SSML support is implemented in the PollyTTS
plugin:
class PollyTTS(TTS):
def __init__(self, *args, **kwargs):
ssml_tags = ["speak", "say-as", "voice", "prosody", "break",
"emphasis", "sub", "lang", "phoneme", "w", "whisper",
"amazon:auto-breaths", "p", "s", "amazon:effect", "mark"]
super().__init__(*args, **kwargs, audio_ext="mp3",
ssml_tags=ssml_tags, validator=PollyTTSValidator(self))
def get_tts(self, sentence, wav_file, lang=None, voice=None):
# SSML handling specific to PollyTTS
# Replace custom SSML tags for Amazon Polly
sentence = sentence.replace("\whispered", "/amazon:effect") \
.replace("\\whispered", "/amazon:effect") \
.replace("whispered", "amazon:effect name=\"whispered\"")
# altermatively the plugin could override self.modify_tag method instead
# Synthesize speech using Amazon Polly API
# Write audio stream to WAV file
return wav_file, None
In this example, the PollyTTS
plugin defines a list of supported SSML tags, including both standard and Amazon-specific tags. During initialization, the plugin sets up SSML support by providing the list of tags to the TTS
superclass.
When synthesizing speech using Amazon Polly, the plugin translates platform-specific SSML tags like amazon:effect
to ensure compatibility with Amazon's speech synthesis service.
Behavior with Plugins That Do Not Support SSML
When SSML text is sent to a TTS plugin that does not support SSML, the plugin will typically ignore the SSML tags and process the text as regular speech.
This means that any SSML-specific effects, such as pauses, emphasis, or prosody modifications, will be dropped, and the synthesized speech will be generated without considering the SSML markup.
It's important to ensure compatibility between the SSML content and the capabilities of the TTS plugin being used. If SSML-specific effects are essential for the intended speech output, it's recommended to verify that the selected TTS plugin supports SSML or consider using a different plugin that provides SSML support.