Skip to main content

Speak Sentence

Text Content

NameDescription
textThe text to speak. Cannot be blank. Can be a mixture of plain text and SSML tags (see below for list of supported tags).

Attributes

Attributes of the speaker may be changed using these values. The default speaker is a female speaker with locale en_US.

To change the gender or locale of the speaker, change the appropriate attributes and a new voice will be selected that matches the new gender and locale.

To choose a specific voice by name, use the voice attribute.

AttributeDescription
voiceSelects the voice of the speaker. Consult the voice column in the below table for valid values.

If the voice attribute is present, gender and locale are ignored.
genderSelects the gender of the speaker. Valid values are "male" or "female".

Default "female". If the chosen gender does not exist for the region, the opposite gender will be used by default.
localeSelects the locale of the speaker. Consult the locale column in the below table for valid values.

Default "en_US"

Supported Voices

The table below shows a mapping of our supported voices. It also maps our voice names to our current TTS provider (Amazon Polly). Please note that in order to provide our customers the best experience possible we reserve the right to change TTS providers in the future.

voicelocalegenderProvider Name (AWS Polly)Voice Type
julieen_USfemaleJoannaStandard
kateen_USfemaleKendraStandard
susanen_USfemaleKimberlyStandard
daveen_USmaleMatthewStandard
paulen_USmaleMatthewStandard
bridgeten_UKfemaleAmyStandard
simonen_UKmaleBrianStandard
katrindefemaleMarleneStandard
stefandemaleHansStandard
esperanzaesfemaleConchitaStandard
violetaesfemaleLuciaStandard
jorgeesmaleEnriqueStandard
rosaes_MXfemaleMiaStandard
joliefrfemaleCelineStandard
bernardfrmaleMathieuStandard
paolaitfemaleCarlaStandard
lucaitmaleGiorgioStandard
masakojafemaleMizukiStandard
kenjijamaleTakumiStandard
nadiyarufemaleTatyanaStandard
anatolirumaleMaximStandard
zeinaarbfemaleZeinaStandard
zhiyucmn-CNfemaleZhiyuStandard
ruthen_USfemaleRuthEnhanced*
stephenen_USmaleStephenEnhanced*
lupees_USfemaleLupeEnhanced*
pedroes_USmalePedroEnhanced*
gabriellefr_CAfemaleGabrielleEnhanced*
liamfr_CAmaleLiamEnhanced*
sallien_USfemaleSalliStandard
salli_enhen_USfemaleSalliEnhanced*
chantalfr_CAfemaleChantalStandard
migueles_USmaleMiguelStandard
joeyen_USmaleJoeyStandard
joey_enhen_USmaleJoeyEnhanced*
penelopees_USfemalePenelopeStandard
russelles_AUmaleRussellStandard
emmaen_GBfemaleEmmaStandard
emma_enhen_GBfemaleEmmaEnhanced*
nicoleen_AUfemaleNicoleStandard
raveenaen_INfemaleRaveenaStandard
madsda_DKmaleMadsStandard
justinen_USmaleJustinEnhanced*
ivyen_USfemaleIvyStandard
ivy_enhen_USfemaleIvyEnhanced*
carmenro_ROfemaleCarmenStandard
najada_DKfemaleNajaStandard
rubennl_NLmaleRubenStandard
gerainten_GB_WLSmaleGeraintStandard

* Please note that the “Enhanced” voices are based on Amazon Polly Neural voices which provide more natural sounding speech but their use will incur additional costs.

Supported SSML Tags

List of supported SSML tags.

Full details about SSML tags can be found at: https://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/

<break>

Adds a pause to the speech. You can specify the duration of the pause by using either the strength or the time attributes.

Attributes:

  • strength: (optional) accepted values are: none, x-weak, weak, medium (default), strong or x-strong
  • time: (optional) the duration of the pause in seconds or milliseconds (e.g. 1s or 1000ms) with a maximum value of 10 seconds

<emphasis>

The contained text is spoken with emphasis.

Attributes:

  • level: (optional) defines the strength of emphasis to be applied, accepted values are: strong, moderate or reduced
note

<emphasis> is supported exclusively by the Standard TTS format and is not supported by Neural voices.

<lang>

Specifies the natural language of the content. If you use a voice with an en_US locale and an es-MX lang, it will sound like an English-speaking American attempting to speak Spanish.

Attributes:

  • xml:lang: specifies the language, accepted values are:
    • arb: Arabic
    • cmn-CN: Chinese, Mandarin
    • da-DK: Danish
    • nl-NL: Dutch
    • en-AU: English, Australian
    • en-GB: English, British
    • en-IN: English, Indian
    • en-US: English, US
    • fr-FR: French
    • fr-CA: French, Canadian
    • hi-IN: Hindi
    • de-DE: German
    • is-IS: Icelandic
    • it-IT: Italian
    • ja-JP: Japanese
    • ko-KR: Korean
    • nb-NO: Norwegian
    • pl-PL: Polish
    • pt-BR: Portuguese, Brazilian
    • pt-PT: Portuguese, European
    • ro-RO: Romanian
    • ru-RU: Russian
    • es-ES: Spanish, European
    • es-MX: Spanish, Mexican
    • es-US: Spanish, US
    • sv-SE: Swedish
    • tr-TR: Turkish
    • cy-GB: Welsh

<p>

Adds a pause between paragraphs.

<phoneme>

Use phonetic pronunciation for specific text.

Attributes:

<prosody>

Controls the volume, rate and pitch of the speech.

Attributes:

  • pitch: (optional) changes the pitch, accepted values: x-low, low, medium, high, x-high, default or a relative change in % (e.g. -15% or 20%)
  • rate: (optional) changes the speaking rate, accepted values: x-slow, slow, medium, fast, x-fast or any positive percentage (e.g. 50% for a speaking rate of half the default rate or 200% for a speaking rate twice the default rate)
  • volume: (optional) changes the volume, accepted values: silent, x-soft, soft, medium, loud, x-loud, default or the volume in dB (e.g. +1dB or -6dB)
note

<prosody> attributes are fully supported by the standard TTS voices. Neural voices support the volume and rate attributes, but don't support the pitch attribute.

<s>

Adds a pause between lines or sentences.

<say-as>

Indicates how to interpret the text.
More information at: https://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/

Attributes:

  • interpret-as: accepted values:
    • date: the contained text is a Gregorian calendar date, must specify the format attribute, see below
    • time: the contained text is a time in minutes and seconds (e.g. 1'20")
    • telephone: the contained text is a 7-digit or 10-digit telephone number (e.g. 2025551212)
    • characters: enclosed text should be spoken as a series of alpha-numeric characters
    • cardinal: the enclosed text is an integral or decimal number and should be spoken as a cardinal number
    • ordinal: the enclosed text is an integral number and should be spoken as an ordinal number
  • format: (optional) used with interpret-as="date" to specify the date format, accepted values:
    • mdy: Month-day-year
    • dmy: Day-month-year
    • ymd: Year-month-day
    • md: Month-day
    • dm: Day-month
    • ym: Year-month
    • my: Month-year
    • d: Day
    • m: Month
    • y: Year
note

The <say-as> tag includes the interpret-as attribute with a characters value, which is not currently supported by neural voices. If your SSML includes this attribute and is processed with a neural voice, the affected text will be synthesized using a standard voice instead. Please note that even though a standard voice is used for this part of the synthesis, the billing will still be based on the neural voice

<sub>

The text in the alias attribute replaces the contained text for pronunciation.

Attributes:

  • alias : string to be spoken

Webhooks Received

There are no webhooks received after the <SpeakSentence> verb is executed.

Examples

Speak Sentence

<?xml version="1.0" encoding="UTF-8"?>
<Response>
<SpeakSentence voice="julie">
This is a test.
</SpeakSentence>
</Response>

Speak Sentence with SSML

<?xml version="1.0" encoding="UTF-8"?>
<Response>
<SpeakSentence voice="jorge">
Hello, you have reached the home of <lang xml:lang="es-MX">Antonio Mendoza</lang>.
Please leave a message.
</SpeakSentence>
</Response>