Give Your App Machine Learning Powered Text-To-Speech with AWS Polly

AWS polly hero banner

If your application needs a way to convert text to speech programmatically to interact with users, AWS has a managed service that uses machine learning to create lifelike believable voices that improve your user experience significantly.

Neural Based Text-to-Speech Is So Much Better

We can’t overstate this enough, neural text-to-speech (TTS) sounds fluid and human, much like Siri or Alexa, and standard TTS sounds robotic in comparison (though, admittedly, still quite acceptable).

You really have to hear it for yourself. Listen to this example using standard TTS.

Now listen to this example using neural TTS. Hear the difference? The transitions between words are much smoother than what can be achieved programmatically. Which one do you want to put in front of users?

With Polly, robotic TTS is a thing of the past. Like most AWS services, you’re charged based on usage. The going rate for neural TTS is $16 per million characters of text. If you’re building a conversational application, the responses will usually be fairly short, which cuts down on cost.

AWS Polly pricing chart

AWS Polly also supports standard TTS, which is four times cheaper and also used as a fallback for certain languages that don’t have neural support yet. It’s still quite good, though not quite on the level of the neural engine.

You can also provide Polly with custom lexicons, which enables you to change the pronunciation of certain words to customize the response you get, or fix errors with the text to speech engine. You can also use Speech Synthesis Markup Language (SSML) as input, which gives fine control over the output.

To get started, head over to the Polly Console. This service is extremely simple—just give Polly the text you want to convert, select a language, and select the voice you wish to use. You can press the “Listen To Speech” button to preview the results:

AWS polly console

You can download the file as an MP3 from here, or save it to S3. If you’re converting more than 3,000 characters, you’ll have to save the input file to S3.

Of course, using a service like this from the console isn’t that useful. You’re far more likely to want to access programmatically using the AWS API or the CLI. We’ll cover the CLI here, but you can read the API documentation for Polly for reference on how to set that up.

The aws polly command contains all of the controls for working with Polly. You can get a list of all supported voices with describe-voices, which you’ll likely want to pass to jq:

aws polly describe-voices | jq '.Voices'

The synthesize-speech command will convert text, given a few options:

aws polly synthesize-speech \ --output-format mp3 \ --voice-id Joanna \ --text 'Text to read' \ example.mp3

This downloads the MP3 locally. If you want to create a task that reads and writes from S3, use start-speech-synthesis-task:

aws polly start-speech-synthesis-task \ --engine neural --region us-west-1 \ --endpoint-url "" \ --output-format mp3 \ --output-s3-bucket-name your-bucket-name \ --output-s3-key-prefix optional/prefix/path/file \ --voice-id Joanna \ --text file://text_file.txt

This reads the input from a text file on disk, and outputs to the bucket you specify optionally under a specific folder.

If you’re thinking of using Polly to build a chatbot, you may want to look into AWS Lex, a managed chatbot service that uses Polly for speech synthesis.


Author: admin

Leave a Reply

Your email address will not be published. Required fields are marked *