Swift Text-To-Speech tool As Deep As Possible
Have you ever thought about converting text to speech for blind to use app? Apple has done a great VoiceOver feature in both iOS and MacOS system for helping blind to use app by providing audio feedback on which object he is now focusing on. Luckily, Apple has also opened the Text-To-Speech feature for developers to implement a VoiceOver like feature. The API is AVSpeechSynthesizer
which belongs to AVFoundation
framework. This article will go through the details and pitfalls when handling with this API.
Let’s get a brief image about the whole article first:
AVSpeechSynthesizer
AVSpeechSynthesizer
is:
- Under
AVFoundation
framework - Available since iOS
7.0
It provides the speech-related functions:
- Play
- Pause
- Continue (Only for a paused speech)
- Stop
Important properties:
- A queue of
AVSpeechUtterance
, which contains a set of speech parameters:
1.1 Text
1.2 Voice
1.3 Rate
1.4 Pitch multiplier
1.5 Volume (Should be better named as “Relative volume” instead)
1.6 Pre-utterance delay
1.7 Post utterance delay - An
AVSpeechSynthesizerDelegate
object with 6 speech callbacks:
2.1 didStart
2.2 didFinish
2.3 didPause
2.4 didContinue
2.5 didCancel
2.6 willSpeakRangeOfSpeechString
AVSpeechUtterance
AVSpeechUtterance
is made for storing 7 speech parameters, which are stated at the above session. It will finally be passed to AVSpeechSynthesizer
to output speech. See below codes:
Note 1:
When multiple AVSpeechUtterance
instances are passed to AVSpeechSynthesizer
, they would not be outputted at the same time but are queued and outputted one by one.
Note 2:
Audio channel in iOS only allows one source per time. When one AVSpeechSynthesizer
instance is outputting voice, any other AVSpeechSynthesizer
instances cannot interrupt it. The requests of speak()
from other AVSpeechSynthesizer
instances are completely ignored without any runtime error and warning. Developer should bear in mind about this.
AVSpeechUtterance parameter 1: Text
The AVFoundation
framework is quite powerful and can recognise many different text formats. See below for the text and corresponding pronunciation:
- Plain text — “Hello world” : “Hello world”
- Emoji — “❤️” : “Red heart”
* It is supposed to have a pronunciation the same as that stated in The Unicode Consortium or Emojipedia. However, NOT EVERY emoji pronunciation can match with the one stated at the official website. For example, “😀” is pronounced as “Grinning Face With Normal Eyes” in iOS but stated as “Grinning Face With Open Mouth” in Emojipedia. This will be discussed in the later part of this article. - Prefix unit (dollar) — “$100.99” : “One hundred dollars and ninety nine cents”
- Postfix unit (cm) — “10cm” : Ten centimeters
- Non-found words — “hdbjfcng” : “H-D-B-J-F-C-N-G”
It can greatly reduce the developer’s responsibility on parsing special characters or phrases into understandable speech.
AVSpeechUtterance parameter 2: Voice
There are lots of different languages in the world and each of them has some accent variations among different countries. Take English (en) as an example, there are totally 5 supported English accents in iOS. They are:
- British English (en-GB)
- American English (en-US)
- Australian English (en-AU)
- Irish English (en-IE)
- South African English (en-ZA)
Apple takes care of iOS users by providing 52 default voice tracks in iOS and all of them are in “Compact” mode. User can manually upgrade the voice tracks to “Enhanced” mode in order to have a better voice quality. See below image for the procedures of setting (red) and upgrading (green):
There are two ways to initialise an AVSpeechSynthesisVoice
object for the voice parameter of AVSpeechUtterance
:
- By voice identifier (Self-explanatory)
List of available voice identifier can be retrieved byAVSpeechSynthesisVoice.speechVoices()
. See below codes:
2. By voice language code
Language code can be in either short form or long form.
For example, English:
- Short form: “en”
- Long form: “en-AU”, “en-GB”, “en-IE”, “en-US”, “en-ZA”
The initialised AVSpeechSynthesisVoice
are varied by 2 factors:
- Type of language code
- Whether user has his own preference on that specific accent.
Factor 1 — Short form / Long form language code:
For long form “en-AU”, it is straight forward that it points to Australian English accent.
However, “en” represents the big category “English” and is thus ambiguous. Other examples are “es” (Spanish), “fr” (French), “nl” (Dutch), “pt” (Portuguese) and “zh” (Chinese). Which specific accent do they refer to? It depends on the definition made by Apple:
- “en” refers to “en-US”
- “es” referes to “es-ES”
- “fr” referes to “fr-FR”
- “nl” referes to “nl-NL”
- “pt” referes to “pt-PT”
- “zh” referes to “zh-CN”
Factor 2: User preference on the language category
Each language category maybe contains a list of accent list. Take “English” as an example, there are 5 accents:
- British English (en-GB)
- American English (en-US)
- Australian English (en-AU)
- Irish English (en-IE)
- South African English (en-ZA)
The AVSpeechSynthesisVoice(language: "en-AU")
would return the user selected voice track if it is in “en-AU” category.
However, user can choose another accent other than the inputted language code “en-AU”, i.e. “en-US” or “en-GB” or “en-IE” or “en-ZA”. In this case, iOS will return the system default voice track of “en-AU” accent to you, i.e. “Karen”, which is fixed.
Note: Enhanced version of default voice track would be returned if it exists.
Here is a list of system default voice track of all English accents:
- British English (en-GB) — Daniel
- American English (en-US) — Samantha
- Australian English (en-AU) — Karen
- Irish English (en-IE) — Moira
- South African English (en-ZA) — Tessa
AVSpeechUtterance parameter 3: Volume
Volume parameter is a relative volume to the device volume and ranges from 0.0
to 1.0
.
In iOS 12
- Totally 16 volume levels
- Each level contributes for 6.25% of the max volume.
Example:
- 8th volume level is selected by user
volume
parameter is set to be0.5
- Actual output volume:
8 * 0.5 = 4th volume level (4 * 6.25% = 25% of max volume)
This can prevent iOS app from setting a volume that is greater than the user’s preferred device volume.
AVSpeechUtterance parameter 4: Rate
Rate controls how fast the speech is going to be outputted. It ranges from 0.0
to 1.0
with default value is 0.5
.
AVSpeechUtterance parameter 5: PitchMultiplier
PitchMultiplier controls how sharp each word is going to be pronounced. It ranges from 0.5
to 2.0
with default value 1.0
.
If it is set too low, female voice track would sound like a male voice! Thus, this parameter should be carefully handled or even just simply ignored.
AVSpeechUtterance parameter 6 & 7: preUtteranceDelay & postUtteranceDelay
PreUtteranceDelay is the time delay before starting the current AVSpeechUtterance
and postUtteranceDelay
is the opposite meaning.
Bear in mind that:
Total delay between two consecutive speeches
= postUtteranceDelay
of previous AVSpeechUtterance
+ preUtteranceDelay
of the current AVSpeechUtterance
.
Speech operations
There are 4 speech operations provided in AVSpeechSynthesizer
:
- Play
- Pause
- Continue
- Stop
Operation 1 — Play:
Playing a voice track is simple. Just passing the AVSpeechUtterance
object to the speak()
of AVSpeechSynthesizer
instance.
Note that calling speak()
multiple times does not intercept the current playing speech. All newly added voice tracks are queued and outputted one by one.
If you want to override the current playing voice track, it must be stopped with stop()
first.
Operation 2 — Pause:
2 options are available for pausing a playing speech. They are:
- AVSpeechBoundary.immediate
Speech is paused right at the current pronunciation. - AVSpeechBoundary.word
Speech is paused at the last pronunciation of the whole word
Example:
- Original Speech: “Medium”
- Pronunciation: “Me” — “Di” — “Um”
- AVSpeechBoundary.immediate can stop speech at either “Me” or “Di” or “Um”
- AVSpeechBoundary.word can only stop speech at “Um”
Operation 3 — Continue:
A speech can only be continued if it is paused but NOT stopped.
Operation 4 — Stop:
Similar to pause()
, there are 2 options for stopping a playing speech. They are:
- AVSpeechBoundary.immediate
- AVSpeechBoundary.word
Note: Stopped speech can never be resumed by any means. Consider pausing a speech if possible.
AVSpeechSynthesizerDelegate
AVSpeechSynthesizerDelegate provides totally 6 callbacks for showing different statuses of AVSpeechSynthesizer
during outputting voice. They are:
- didStart
- didFinish
- didPause
- didContinue
- didCancel
- willSpeakRangeOfSpeechString
Case 1 — Play a speech to the end:
- Play a speech (didStart)
- Complete a speech (didFinish)
Case 2 — Play & Stop a speech:
- Play a speech (didStart)
- Stop a speech (didCancel)
Case 3 — Play & Resume a speech:
- Play a speech (didStart)
- Pause a speech (didPause)
- Resume a speech (didContinue)
- Until the end of speech (didFinish)
Practical considerations of AVSpeechSynthesizer
Consideration 1: Mixed texts in English and other languages
If you are targeting your app to global market, it sometimes might contain paragraph mixing with texts in both English and other languages. Take below string “Hong Kong (香港) is in Asia.” as an example:
All compact and enhanced version of voice tracks can handle this case quite smoothly. However, due to the different accents in different voice tracks, you should not expect a 100% fluent English pronunciation to every English word. There must be a little bit trade off.
Consideration 2: iOS interrupts AVSpeechSynthesizer at app lifecycle
iOS automatically stopped the AVSpeechSynthesizer
with a smooth volume reduction when app has been sent to background (For example, pressing the Home button) and resume it with a smooth increase in volume when app has been sent to foreground again. Developer can monitor the states at the applicationDidEnterBackground
and applicationDidBecomeActive
in AppDelegate
class. This helps developer a lot to provide a satisfying user experience to user!
Consideration 3: Speech is not stopped even when UIViewController is dismissed or popped
iOS continues outputting speech even after UIViewController is dismissed or popped out, and therefore holds a reference to the AVSpeechSynthesizer
instance. It blocks UIViewController from de-initialisation until the AVSpeechSynthesizer
has completed the whole speech. See below codes:
Log:
viewDidDisappear
...
(Wait until AVSpeechSynthesizer completes the whole speech)
...
AVFoundationDemoViewController is deinit.
Solution:
Consideration 4: Interrupt with Music app
Music app can output sound when app is in the background. However, it shares the same audio channel with other apps. That means the music sound track will be stopped when other app wants to use the audio channel to output sound. The app at foreground has a higher privilege on using the audio channel. However, iOS does not resume the original music track when other app has done with the audio channel. UX designer should take care of this side effect to user experience.
Consideration 5: Interpretation of emoji
As stated before, AVSpeechSynthesizer
supports emoji but Apple has her own interpretation of each emoji symbol. It does not 100% match with the documentation at the The Unicode Consortium or Emojipedia.
For example, the common emoji 😀 is pronounced as “Grinning face with normal eyes” in iOS but both the documentation at The Unicode Consortium and Emojipedia and shows that it should be “Grinning Face” only.
At this moment, some of you might think that there is a “Character Viewer” (Command + Control + Space) in Mac for entering emoji symbol and there is a name for each of emoji. Would that be the Apple’s interpretation? The answer is “No”. The name of 😀 is still “Grinning face”.
After googling for a long time, I still cannot find any official document showing how Apple interprets on each emoji symbol. If you could find any related documentation, please share at the comment session and it would help me a lot. Thank you very much.
Conclusion:
AVSpeechSynthesiser
can recognise plain text, emoji, prefix unit and postfix unit. It can also handle sentences mixed with both English and other language words if correct voice parameter is chosen.AVSpeechSynthesiser
has a queue ofAVSpeechUtterance
and plays the items one by one instead of playing all of them synchronously.- It is necessary to confirm that no other
AVSpeechSynthesiser
instance is outputting voice, else the request of outputting voice will be ignored with no error message. - Voice parameter in
AVSpeechUtterance
is varied by the inputted language code and user’s selected accent at specific language. - Volume parameter ensures app cannot output with volume higher than current device volume.
- Stopped speech can never be resumed.
- Custom logic to
AVSpeechSynthesiser
can be set at theapplicationDidEnterBackground
andapplicationDidBecomeActive
inAppDelegate.swift
- It is necessary to stop any active
AVSpeechSynthesiser
in order to de-initialise the UIViewController successfully. - Interpretation of emoji by
AVFoundation
framework is still not sure and needs further consideration. (Please leave a comment if you find any useful information.)
Thank you for reading this article. Please follow me at Twitter@myrick_chow for more information. Hope you can have a better understanding on how to convert text to speech with AVSpeechSynthesizer
! Let’s create a great app! 😊