Skip to main content

How to Play Media and use Text-to-Speech

In this guide we will show you how to play media and use text to speech for calls. Please ensure you have followed our earlier guide on how to make an outbound call with Bandwidth.

You may want to play media for on-hold music or use text-to-speech to play descriptive messages to your customers.

Play Media

The BXML PlayAudio verb is used to play an audio file in the call with the ability to play multiple audio files in succession. The audio file should already be hosted and the URL of an audio file should be included in the body of the <PlayAudio> tag.

<?xml version="1.0" encoding="UTF-8"?>
<SpeakSentence>Hello! Here is a sponsored message.</SpeakSentence>

Once the call is created using our API we check the specific answerUrl for a BXML response which tells us to play the media file.

In this example, two audio files are played for the caller; one from an absolute endpoint hosted somewhere other than where the application is, and one is a relative endpoint. The relative endpoint assumes there is an endpoint in the application that serves an audio file.


ONLY .wav and .mp3 files are supported. To ensure playback quality, Bandwidth recommends limiting audio files to less than 1 hour in length or 250 MB in size.


The <SpeakSentence> verb is used for text-to-speech playback on a call. Attributes of the speaker may be changed including the gender or locale of the speaker. The default speaker susan is a female speaker with locale en_US. All supported speakers can be viewed here.

<?xml version="1.0" encoding="UTF-8"?>
<SpeakSentence voice="bridget">
Hello <lang xml:lang="en-GB">Sherlock Holmes</lang>.
You have an appointment on <say-as interpret-as="date" format="mdy">11/12/2022</say-as>.

In this example, once the call is created using our API we check the specific answerUrl for a BXML response which tells us to playback the specified text.

Speech Synthesis Markup Language (SSML) tags allow you to use XML-based markup language for assisting the generation of synthesized speech providing you with additional functionality. Here, the name Sherlock Holmes is said with British inflection, and the date is pronounced as "November twelfth, twenty-twenty-two" instead of the numbers being read. All supported SSML tags can be viewed here.

Where to next?

Now that you have made your first outbound call with playing media or text-to-speech, some of the available actions are available in the following guides: