Swift Text-To-Speech tool As Deep As Possible

Myrick Chow
ITNEXT
Published in
10 min readFeb 14, 2019

--

Have you ever thought about converting text to speech for blind to use app? Apple has done a great VoiceOver feature in both iOS and MacOS system for helping blind to use app by providing audio feedback on which object he is now focusing on. Luckily, Apple has also opened the Text-To-Speech feature for developers to implement a VoiceOver like feature. The API is AVSpeechSynthesizer which belongs to AVFoundation framework. This article will go through the details and pitfalls when handling with this API.

Let’s get a brief image about the whole article first:

AVSpeechSynthesizer

AVSpeechSynthesizer is:

  1. Under AVFoundation framework
  2. Available since iOS 7.0

It provides the speech-related functions:

  1. Play
  2. Pause
  3. Continue (Only for a paused speech)
  4. Stop

Important properties:

  1. A queue of AVSpeechUtterance, which contains a set of speech parameters:
    1.1 Text
    1.2 Voice
    1.3 Rate
    1.4 Pitch multiplier
    1.5 Volume (Should be better named as “Relative volume” instead)
    1.6 Pre-utterance delay
    1.7 Post utterance delay
  2. An AVSpeechSynthesizerDelegate object with 6 speech callbacks:
    2.1 didStart
    2.2 didFinish
    2.3 didPause
    2.4 didContinue
    2.5 didCancel
    2.6 willSpeakRangeOfSpeechString

AVSpeechUtterance

AVSpeechUtterance is made for storing 7 speech parameters, which are stated at the above session. It will finally be passed to AVSpeechSynthesizer to output speech. See below codes:

Note 1:
When multiple AVSpeechUtterance instances are passed to AVSpeechSynthesizer, they would not be outputted at the same time but are queued and outputted one by one.

Note 2:
Audio channel in iOS only allows one source per time. When one AVSpeechSynthesizer instance is outputting voice, any other AVSpeechSynthesizer instances cannot interrupt it. The requests of speak() from other AVSpeechSynthesizer instances are completely ignored without any runtime error and warning. Developer should bear in mind about this.

AVSpeechUtterance parameter 1: Text

The AVFoundation framework is quite powerful and can recognise many different text formats. See below for the text and corresponding pronunciation:

  1. Plain text — “Hello world” : “Hello world”
  2. Emoji — “❤️” : “Red heart”
    * It is supposed to have a pronunciation the same as that stated in The Unicode Consortium or Emojipedia. However, NOT EVERY emoji pronunciation can match with the one stated at the official website. For example, “😀” is pronounced as “Grinning Face With Normal Eyes” in iOS but stated as “Grinning Face With Open Mouth” in Emojipedia. This will be discussed in the later part of this article.
  3. Prefix unit (dollar) — “$100.99” : “One hundred dollars and ninety nine cents”
  4. Postfix unit (cm) — “10cm” : Ten centimeters
  5. Non-found words — “hdbjfcng” : “H-D-B-J-F-C-N-G”

It can greatly reduce the developer’s responsibility on parsing special characters or phrases into understandable speech.

AVSpeechUtterance parameter 2: Voice

There are lots of different languages in the world and each of them has some accent variations among different countries. Take English (en) as an example, there are totally 5 supported English accents in iOS. They are:

  1. British English (en-GB)
  2. American English (en-US)
  3. Australian English (en-AU)
  4. Irish English (en-IE)
  5. South African English (en-ZA)

Apple takes care of iOS users by providing 52 default voice tracks in iOS and all of them are in “Compact” mode. User can manually upgrade the voice tracks to “Enhanced” mode in order to have a better voice quality. See below image for the procedures of setting (red) and upgrading (green):

Procedures of viewing and upgrading voices in iOS Settings

There are two ways to initialise an AVSpeechSynthesisVoice object for the voice parameter of AVSpeechUtterance:

  1. By voice identifier (Self-explanatory)
    List of available voice identifier can be retrieved by AVSpeechSynthesisVoice.speechVoices(). See below codes:

2. By voice language code

Language code can be in either short form or long form.

For example, English:

  • Short form: “en”
  • Long form: “en-AU”, “en-GB”, “en-IE”, “en-US”, “en-ZA”

The initialised AVSpeechSynthesisVoice are varied by 2 factors:

  • Type of language code
  • Whether user has his own preference on that specific accent.

Factor 1 — Short form / Long form language code:

For long form “en-AU”, it is straight forward that it points to Australian English accent.

However, “en” represents the big category “English” and is thus ambiguous. Other examples are “es” (Spanish), “fr” (French), “nl” (Dutch), “pt” (Portuguese) and “zh” (Chinese). Which specific accent do they refer to? It depends on the definition made by Apple:

  • “en” refers to “en-US”
  • “es” referes to “es-ES”
  • “fr” referes to “fr-FR”
  • “nl” referes to “nl-NL”
  • “pt” referes to “pt-PT”
  • “zh” referes to “zh-CN”

Factor 2: User preference on the language category

Each language category maybe contains a list of accent list. Take “English” as an example, there are 5 accents:

  1. British English (en-GB)
  2. American English (en-US)
  3. Australian English (en-AU)
  4. Irish English (en-IE)
  5. South African English (en-ZA)

The AVSpeechSynthesisVoice(language: "en-AU") would return the user selected voice track if it is in “en-AU” category.

However, user can choose another accent other than the inputted language code “en-AU”, i.e. “en-US” or “en-GB” or “en-IE” or “en-ZA”. In this case, iOS will return the system default voice track of “en-AU” accent to you, i.e. “Karen”, which is fixed.

Note: Enhanced version of default voice track would be returned if it exists.

Here is a list of system default voice track of all English accents:

  1. British English (en-GB) — Daniel
  2. American English (en-US) — Samantha
  3. Australian English (en-AU) — Karen
  4. Irish English (en-IE) — Moira
  5. South African English (en-ZA) — Tessa

AVSpeechUtterance parameter 3: Volume

Volume parameter is a relative volume to the device volume and ranges from 0.0 to 1.0.

In iOS 12

  • Totally 16 volume levels
  • Each level contributes for 6.25% of the max volume.
iOS screen capture of volume page

Example:

  • 8th volume level is selected by user
  • volume parameter is set to be 0.5
  • Actual output volume:
    8 * 0.5 = 4th volume level (4 * 6.25% = 25% of max volume)

This can prevent iOS app from setting a volume that is greater than the user’s preferred device volume.

AVSpeechUtterance parameter 4: Rate

Rate controls how fast the speech is going to be outputted. It ranges from 0.0 to 1.0 with default value is 0.5.

AVSpeechUtterance parameter 5: PitchMultiplier

PitchMultiplier controls how sharp each word is going to be pronounced. It ranges from 0.5 to 2.0 with default value 1.0.

If it is set too low, female voice track would sound like a male voice! Thus, this parameter should be carefully handled or even just simply ignored.

AVSpeechUtterance parameter 6 & 7: preUtteranceDelay & postUtteranceDelay

PreUtteranceDelay is the time delay before starting the current AVSpeechUtterance and postUtteranceDelay is the opposite meaning.

Bear in mind that:
Total delay between two consecutive speeches
=
postUtteranceDelay of previous AVSpeechUtterance
+
preUtteranceDelay of the current AVSpeechUtterance.

Speech operations

There are 4 speech operations provided in AVSpeechSynthesizer :

  1. Play
  2. Pause
  3. Continue
  4. Stop

Operation 1 — Play:

Playing a voice track is simple. Just passing the AVSpeechUtterance object to the speak() of AVSpeechSynthesizer instance.

Note that calling speak() multiple times does not intercept the current playing speech. All newly added voice tracks are queued and outputted one by one.

If you want to override the current playing voice track, it must be stopped with stop() first.

Operation 2 — Pause:

2 options are available for pausing a playing speech. They are:

  1. AVSpeechBoundary.immediate
    Speech is paused right at the current pronunciation.
  2. AVSpeechBoundary.word
    Speech is paused at the last pronunciation of the whole word

Example:

  • Original Speech: “Medium”
  • Pronunciation: “Me” — “Di” — “Um
  • AVSpeechBoundary.immediate can stop speech at either “Me” or “Di” or “Um
  • AVSpeechBoundary.word can only stop speech at “Um

Operation 3 — Continue:

A speech can only be continued if it is paused but NOT stopped.

Operation 4 — Stop:

Similar to pause(), there are 2 options for stopping a playing speech. They are:

  1. AVSpeechBoundary.immediate
  2. AVSpeechBoundary.word

Note: Stopped speech can never be resumed by any means. Consider pausing a speech if possible.

AVSpeechSynthesizerDelegate

AVSpeechSynthesizerDelegate provides totally 6 callbacks for showing different statuses of AVSpeechSynthesizer during outputting voice. They are:

  1. didStart
  2. didFinish
  3. didPause
  4. didContinue
  5. didCancel
  6. willSpeakRangeOfSpeechString

Case 1 — Play a speech to the end:

  1. Play a speech (didStart)
  2. Complete a speech (didFinish)

Case 2 — Play & Stop a speech:

  1. Play a speech (didStart)
  2. Stop a speech (didCancel)

Case 3 — Play & Resume a speech:

  1. Play a speech (didStart)
  2. Pause a speech (didPause)
  3. Resume a speech (didContinue)
  4. Until the end of speech (didFinish)

Practical considerations of AVSpeechSynthesizer

Consideration 1: Mixed texts in English and other languages

If you are targeting your app to global market, it sometimes might contain paragraph mixing with texts in both English and other languages. Take below string “Hong Kong (香港) is in Asia.” as an example:

All compact and enhanced version of voice tracks can handle this case quite smoothly. However, due to the different accents in different voice tracks, you should not expect a 100% fluent English pronunciation to every English word. There must be a little bit trade off.

Consideration 2: iOS interrupts AVSpeechSynthesizer at app lifecycle

iOS automatically stopped the AVSpeechSynthesizer with a smooth volume reduction when app has been sent to background (For example, pressing the Home button) and resume it with a smooth increase in volume when app has been sent to foreground again. Developer can monitor the states at the applicationDidEnterBackground and applicationDidBecomeActive in AppDelegate class. This helps developer a lot to provide a satisfying user experience to user!

Consideration 3: Speech is not stopped even when UIViewController is dismissed or popped

iOS continues outputting speech even after UIViewController is dismissed or popped out, and therefore holds a reference to the AVSpeechSynthesizer instance. It blocks UIViewController from de-initialisation until the AVSpeechSynthesizer has completed the whole speech. See below codes:

Log:

viewDidDisappear
...
(Wait until AVSpeechSynthesizer completes the whole speech)
...
AVFoundationDemoViewController is deinit.

Solution:

Consideration 4: Interrupt with Music app

Music app can output sound when app is in the background. However, it shares the same audio channel with other apps. That means the music sound track will be stopped when other app wants to use the audio channel to output sound. The app at foreground has a higher privilege on using the audio channel. However, iOS does not resume the original music track when other app has done with the audio channel. UX designer should take care of this side effect to user experience.

Consideration 5: Interpretation of emoji

As stated before, AVSpeechSynthesizer supports emoji but Apple has her own interpretation of each emoji symbol. It does not 100% match with the documentation at the The Unicode Consortium or Emojipedia.

For example, the common emoji 😀 is pronounced as “Grinning face with normal eyes” in iOS but both the documentation at The Unicode Consortium and Emojipedia and shows that it should be “Grinning Face” only.

Screen captured from The Unicode Consortium about emoji 😀
Screen captured from Emojipedia about emoji 😀

At this moment, some of you might think that there is a “Character Viewer” (Command + Control + Space) in Mac for entering emoji symbol and there is a name for each of emoji. Would that be the Apple’s interpretation? The answer is “No”. The name of 😀 is still “Grinning face”.

After googling for a long time, I still cannot find any official document showing how Apple interprets on each emoji symbol. If you could find any related documentation, please share at the comment session and it would help me a lot. Thank you very much.

Conclusion:

  1. AVSpeechSynthesiser can recognise plain text, emoji, prefix unit and postfix unit. It can also handle sentences mixed with both English and other language words if correct voice parameter is chosen.
  2. AVSpeechSynthesiser has a queue of AVSpeechUtterance and plays the items one by one instead of playing all of them synchronously.
  3. It is necessary to confirm that no other AVSpeechSynthesiser instance is outputting voice, else the request of outputting voice will be ignored with no error message.
  4. Voice parameter in AVSpeechUtterance is varied by the inputted language code and user’s selected accent at specific language.
  5. Volume parameter ensures app cannot output with volume higher than current device volume.
  6. Stopped speech can never be resumed.
  7. Custom logic to AVSpeechSynthesiser can be set at the applicationDidEnterBackground and applicationDidBecomeActive in AppDelegate.swift
  8. It is necessary to stop any active AVSpeechSynthesiser in order to de-initialise the UIViewController successfully.
  9. Interpretation of emoji by AVFoundation framework is still not sure and needs further consideration. (Please leave a comment if you find any useful information.)

Thank you for reading this article. Please follow me at Twitter@myrick_chow for more information. Hope you can have a better understanding on how to convert text to speech with AVSpeechSynthesizer! Let’s create a great app! 😊

--

--

Mobile Lead @REAL Messenger Inc. https://real.co Focus on Android & iOS Native programming.