Grapheme to Phoneme Plugins

Grapheme to Phoneme is the process of converting text into a set of "sound units" called phonemes

In ovos-audio these plugins are used to auto generate mouth movements / visemes in the TTS stage.

They can also be used to help configuring wake words or to facilitate training of TTS systems

These plugins can provide phonemes either in ARPA or IPA alphabets, an automatic conversion will happen behind the scenes when needed

Visemes

Visemes are representations of the shape of a human mouth when speaking.

visemes

Mouth movements are generated via a mapping of ARPA to VISEMES, TTS plugins may provide this natively, or a G2P plugin may be used to estimate it directly from text

Visemes are predefined mouth positions, timing per phonemes is crucial for a natural mouth movement

The Mycroft Mark 1 uses this to make his "lips" match his speech.

OpenVoiceOS uses six basic visemes.

viseme mouth position
0 wide open
1 pursed
2 open
3 narrow lips
4 closed lips
5 parted lips
6 barely open lips

Mapping based on Jeffers phoneme to viseme map, seen in table 1, partially based on the "12 mouth shapes visuals seen here

List of G2P plugins

Plugin Type Duration
ovos-g2p-plugin-mimic ARPA ✔️
ovos-g2p-plugin-heuristic-arpa ARPA
ovos-g2p-plugin-espeak IPA
imagem neon-g2p-cmudict-plugin ARPA
imagem neon-g2p-phoneme-guesser-plugin ARPA
imagem neon-g2p-gruut-plugin IPA

Standalone Usage

All G2P plugins can be used as follows


utterance = "hello world"
word = "hello"
lang = "en-us"

plug = G2pPlugin()

# convert a word into a list of phonemes
phones = plug.get_ipa(word, lang)
assert phones == ['h', 'ʌ', 'l', 'oʊ']

phones = plug.get_arpa(word, lang)
assert phones == ['HH', 'AH', 'L', 'OW']

# convert a utterance into a list of phonemes
phones = plug.utterance2arpa(utterance, lang)
assert phones == ['HH', 'AH', 'L', 'OW', '.', 'W', 'ER', 'L', 'D']

phones = plug.utterance2ipa(utterance, lang)
assert phones == ['h', 'ʌ', 'l', 'oʊ', '.', 'w', 'ɝ', 'l', 'd']

# convert a utterance into a list of viseme, duration pairs
visemes = plug.utterance2visemes(utterance, lang)
assert visemes == [('0', 0.0775), ('0', 0.155), ('3', 0.2325), ('2', 0.31), ('2', 0.434), ('2', 0.558), ('3', 0.682),
                   ('3', 0.806)]

Plugin Template

from ovos_plugin_manager.templates.g2p import Grapheme2PhonemePlugin
from ovos_utils.lang.visimes import VISIMES


# base plugin class
class MyARPAG2PPlugin(Grapheme2PhonemePlugin):
    def __init__(self, config=None):
        self.config = config or {}

    def get_arpa(self, word, lang, ignore_oov=False):
        phones = []  # TODO implement
        return phones

    def get_durations(self, utterance, lang="en", default_dur=0.4):
        words = utterance.split()
        phones = [self.get_arpa(w, lang) for w in utterance.split()]
        dur = default_dur  # TODO this is plugin specific
        return [(pho, dur) for pho in phones]

    def utterance2visemes(self, utterance, lang="en", default_dur=0.4):
        phonemes = self.get_durations(utterance, lang, default_dur)
        return [(VISIMES.get(pho[0].lower(), '4'), float(pho[1]))
                for pho in phonemes]

If your plugin uses IPA instead of ARPA simply replace get_arpa with get_ipa

from ovos_plugin_manager.templates.g2p import Grapheme2PhonemePlugin
from ovos_utils.lang.visimes import VISIMES


# base plugin class
class MyIPAG2PPlugin(Grapheme2PhonemePlugin):
    def __init__(self, config=None):
        self.config = config or {}

    def get_ipa(self, word, lang, ignore_oov=False):
        phones = []  # TODO implement
        return phones

    def get_durations(self, utterance, lang="en", default_dur=0.4):
        # auto converted to arpa if ipa is implemented
        phones = [self.get_arpa(w, lang) for w in utterance.split()]
        dur = default_dur  # TODO this is plugin specific
        return [(pho, dur) for pho in phones]

    def utterance2visemes(self, utterance, lang="en", default_dur=0.4):
        phonemes = self.get_durations(utterance, lang, default_dur)
        return [(VISIMES.get(pho[0].lower(), '4'), float(pho[1]))
                for pho in phonemes]