Introduction
This article shows how to create an application in WPF that makes a good use of Speech APIs in the .NET Framework to generate spoken responses for the written messages that we provide it. Also, as a bonus, I will show how to change the voices and the rate of speech.
Today was my last day of my Bachelor for Computer Science program, yesterday I was thinking that I should try my luck in some sort of literature or something like that, but there was only one thing on my mind. That was, how will I read all those textbooks? All of a sudden, I thought, why not create software to read the text for me, that I have entered so that I can enjoy someone else reading it for me. Thus, I created the app and I want to share it with others.
The application source code contains the assemblies and other tools required to build this very intuitive application that reads the message ed to it. The source code shows how to change the speaking rate, voice of the audio and how to stop the speaking process if you want to stop the audio playback.
Figure 1: Application interface containing a sample text in the TextBox, default rate (zero), first installed voice selected and three buttons available for each of three functions.
Requirements
The requirements to read the article are an internet connection and a web browser. But to use the application, you need to build the application first as I have removed all of the binaries and object files so that you can use the source code and build it on your own platform. I have the following setup environment:
- Microsoft Visual Studio 2013 (Ultimate edition)
- .NET Framework 4.5
You can surely try out the applications source code in your own IDE and environment. At most it just won't compile, nothing big!
Getting started
First of all, we need to learn what our application will do, or what is the program actually going to do for us. Well, the program is a simple input/output program, where the input is string text and the ouput is the speech that we will hear as the output for that message that we provided to the application.
Getting the text
The input is just the message or an entire essay that we want to listen to. Of course it would be a string type data and we want to enter the message (paragraph or whatever) ourself, we would be using the TextBox control in our application to hold the content of the message to be read. A simple TextBox control is enough. If we want, we can add other attributes to it, to make it a perfect fit for our application. For our application, the following XAML markup is enough to generate the text box to get the input from the user.
- <TextBox Name="text" Height="200"
- Text="Hello there, enter some text and I would read it for you!"
- TextWrapping="Wrap"></TextBox>
This input part is quite easy and short. The most time-consuming part is the speech part and the events are to begin or stop the speaking or to change the output of the application. In this article, I will show you the following two types of outputs:
- Speaking the output through the default device. In most cases it is the speaker, hands free or other device if you have configured them in your Control Panel. Good to read the text at the moment.
- Saving the output in a Waveform format file (.wav) as audio, to play it later. Good for sharing the text-to-speech file through the network or to play them later.
Keep reading, the source code would be most intuitive so that you can understand the process by reading the source code.
Generating the Speech
First of all, let us talk about the input section. The input can by anything, but most specifically, since we are usign the System.Speech namespace, we would try to remain to as much of a namespace-specific best approach and the best way to solve the problems as much as we can. Also, since we will recognize any input and our input will only be a plain-string-type message, we can only include the System.Speech.Synthesis namespace that holds the objects required to speak an output to the user.
Now that we have an idea of our context, the namespace (System.Speech.Synthesis) and the application development framework (Windows Presentation Foundation), we can now continue to the input and output section.
One thing you should understand before continuing with this article is that we will use only one object from the namespace to create an entire application, SpeechSynthesizer. This object inherits from the IDisposable interface, thus enabling us to call the function Dispose on it as soon as we are done working with it. Or in other words, we can use the using block along with this object as in the following code snippet.
- using (var reader = new SpeechSynthesizer()) {
-
- }
But
do not write it this way. We will use the object in Windows Presentation Foundation (WPF) that uses only one thread to execute the business logic and update the user-interface. You write your application in a most efficient way as in the following code snippet:
- using (var reader = new SpeechSynthesizer()) {
-
- reader.Speak(message);
- }
The preceding code would take care of the resources itself. Clear them out as soon as there is no need for them, it would also speak the message out. The application would do its work as expected. But, the
application would freeze. In most scenarios Windows Presentation Foundation would freeze because another function or thread is currently processing and has not returned to the handler for the event. Button events, network resource access, long loops and processing similar to these can cause our
Speak function to freeze the application, then speak the message and then return control to the thread to update the user interface.
What if I use the SpeakAsync instead of Speak
SpeechSynthesizer exposes two functions,
Speak and
SpeakAsync that can be used to speak out the message that we have ed. If one uses SpeakAsync instead of Speak, he cannot even listen to anything (if using the preceding code sample). This is because as soon as the code hits the SpeakAsync, it returns to from where it was called instead of executing the complete function and then continuing, instead it (executes the next, then the next and then) calls the
Dispose function on the object
reader. That makes it inaccessible for other threads, because the object has now been disposed.
Remember: SpeakAsync cannot be awaited
That leaves you to create your own functions to maintain the application to speak asynchronously, while allowing the user to still access the buttons and other functions. Continue reading the article, in the end we will be able to create the back-end code that is asynchronous (that is, it does not freeze the UI thread) and also is accessible so that the audio can be stopped and the output can be changed and so on and so forth.
For a complete overview of how SpeakAsync works, have a look at the following image.
Figure 2: Shows how a user can write a fully efficient and memory-friendly code to speak
a general text but still gets into trouble.
The procedure in Figure 2 guides how and why there is no output.
One more thing Prompt vs. String
SpeechSynthesizer allows us to use
string-type values, or
Prompt objects to generate speech response for our users. We can use the both of them and the response would be the same.
String is a data type in the .NET Framework, every developer has an understanding of the string type. And, if you use a string type such as a plain literal constant string, you will be able to generate the speech in the same way it would be by using Prompt objects.
Prompt, on the other hand, is an object (class) type present
inSystem.Speech.Synthesis.
- string message = "Hello, world";
- Prompt prompt = new Prompt("Hello, world");
-
- reader.Speak(message);
- reader.Speak(prompt);
Both of the preceding functions would generate the same output, then where is the difference? The difference is that string is just plain text that is spoken, on the other hand prompt is an object that can be generated using a
PromptBuilder object and can contain definitions for paragraphs, pre-recorded audio files, changing the voices and/or the rate at which the speech is rendered and spoken.
If you want to generate an application, for example that reads out the dialogue between two people, you should use the
PromptBuilder (instead of creating a
Prompt each time). For example, the see sample from this
link:
“Now he is here,” I exclaimed. “For Heaven's sake, hurry down! Do be quick; and stay among the trees until he is fairly in.”
I must go, Cathy,” said Heathcliff, seeking to extricate himself from his companion's arms. I won't stray five yards from your window.
“For one hour,” he pleaded earnestly.
“Not for one minute,” she replied.
“I must–Linton will be up immediately,” persisted the intruder.”
In this case, you would require many prompts, or a single PromptBuilder (along with the definitions of paragraphs, audio samples, voices and other Say-as stuff) ed to a Prompt constructor that then creates a new Prompt object for our rendering purposes.
ing the preceding age (or dialogue) directly as a string would not be a good idea, neither would be a good idea to change the output or input type, nor voices many times an efficient solution. In such contexts, ing a Prompt is the efficient way. Whereas, if you are going to read only plain text, like an essay or a paragraph, then string would serve you well enough.
Building the application
In the preceding sections, I have made the application's background a little bit easy for you to understand. Now it is time to use the objects and build up an application that can generate audio output for our text input.
Creating the Window
In the WPF framework, you create windows or pages to render the controls for your user interface. We need the following controls:
- TextBox control: to get the input text from the user.
- Slider control: to get the speaking rate of the speech. (Range from -10 to 10)
- ComboBox control: to get the voice for speaking. (We would bind it to the currently installed voices.)
- Button controls: to trigger various functions. Here are three in our application:
- Read: for reading the text, spoken.
- Stop: for stopping the reading process.
- Save: for saving the output in a Waveform format file.
The XAML markup in my application is:
- <Window x:Class="ApplicationToRead.MainWindow"
- xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
- xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
- Title="Read out for me" Height="380" Width="525">
- <Grid Margin="10">
- <StackPanel>
- <TextBlock FontSize="23" HorizontalAlignment="Center" Margin="0, 0, 0, 10">
- App that reads out for you
- </TextBlock>
- <TextBox Name="text" Height="200" Text="Hello there, enter some text and I would read it for you!" TextWrapping="Wrap"></TextBox>
- <TextBlock FontStyle="Italic">Reading rate</TextBlock>
- <Slider Minimum="-10" Maximum="10" Margin="80, -15, 0, 0"
- Ticks="2" HorizontalAlignment="Left" Width="400"
- TickFrequency="5" TickPlacement="BottomRight"
- Name="slider"
- ></Slider>
- <TextBlock FontStyle="Italic">Select voice</TextBlock>
- <ComboBox Margin="80, -18, 0, 0" Name="comboBox" ItemsSource="{Binding}"></ComboBox>
- <Grid Margin="10">
- <Grid.ColumnDefinitions>
- <ColumnDefinition />
- <ColumnDefinition />
- <ColumnDefinition />
- </Grid.ColumnDefinitions>
- <Button Width="80" Name="read" Click="read_Click">Read</Button>
- <Button Width="80" Name="stop" Click="stop_Click" Grid.Column="1">Stop</Button>
- <Button Width="80" Name="save" Click="save_Click" Grid.Column="2">Save</Button>
- </Grid>
- </StackPanel>
- </Grid>
- </Window>
The GUI has already been shared in the Introduction section of this article. You can view it there.
Back-end code
Now to write the back-end code so that our application can actually do something useful for us. We need a few objects to save our state of the application.
- A variable to store the state of the application, whether the reader is reading or not.
- private bool reading { get; set; }
- We also need to stop the speech when we want to. So, we would create a private handle-like variable to currently spoken prompt.
- private Prompt activePrompt { get; set; }
- We also need the SpeechSynthesizer object.
- private SpeechSynthesizer reader { get; set; }
As already said, we would require the object throughout our application's life because we need it to render the response and provide us with an audio sample to be heard. But if we create the object every time then we will be left with the following two scenarios.
- In the first scenario, we need to use the Speak method (not SpeakAsync) to do what we want it to. It will ensure the text is spoken fully, before anything else is done. But that leads us to the problem of our application freezing until the entire text has been spoken. This is a bad way to write an application.
- In the second, we create a new object (as in the preceding image for SpeakAsync) and use it to speak the text asynchronously. That solves the problem of the application freezing. But it leads to another problem, the user doesn't hear a thing. That has been explained above in the image, please read it.
So, we are left with a scenario where we need to create an object that can be accessed using various functions. Thus a private object is a good suitable candidate. Also remember that having less variables is a good and memory-efficient solution, but a solution that creates too many objects and deletes them after one or two statements is not a good program also, because it takes much CPU to manage the memory. The CPU is also a resource, less RAM + less CPU is a good solution, managing memory only and wasting much CPU is also a bad pattern to follow. Manage them together to create a good application.
Note that we can always call the Dispose function on the object, thus we do not necessarily require the using block. The using block is just a shorthand that will let us forget about clearing the memory resources and focus on how to use the object. Thus remove the using block and create a private object that can be accessed using various functions and gets disposed when the application is closing.
-
- this.Closing += (sender, e) =>
- {
- reader.Dispose();
- };
The preceding code attaches a lambda expression as event handler for the
Closing event of WPF's
Window object. Then it calls the Dispose function so that the object is disposed of when it is no longer needed. Thus this would be a "
Buddy, please!" for memory-efficient freaks.
Now we need the functions (as the event handlers for the Button controls) to do what we want them to and a bit more tinkering to make our application work.
Selecting installed voices
First, we need to list the voices that we have right now. Note that voices are installed as software, a library or a utility. You can only use the voice that has been installed, not the ones you expect or want to hear. For this, select the voices.
-
- List<string> names = new List<string>();
-
-
- foreach (var voice in reader.GetInstalledVoices())
- {
-
- if (voice.Enabled)
- {
- names.Add(voice.VoiceInfo.Name);
- }
- }
-
- comboBox.DataContext = names;
- comboBox.SelectedIndex = 0;
Note that there is a field
Enabled in the
InstalledVoice object that tells you whether a voice is enabled (ready for use) or not. If a voice is not enabled, then it won't be used. That is why I have a condition to load only enabled voices to be used. In my case, they were equal to those with the
Enabled flag
true. The list is then bound to the
comboBox we have, so that our
ComboBox now displays the names of the voices installed.
In the first image, you will see
Microsoft David Desktop, that is an installed voice (I did not, .NET did perhaps or Microsoft.Speech library, I am not sure) along with 2 others,
Microsoft Hazel Desktop and
Microsoft Zira Desktop. Also,
Microsoft David Desktop is selected automatically, because of our code. For example
see the last line of the preceding code block. Reading out the text
In this function, I will show you the code that can be used to generate the speech response that can read out the text to the user.
See the following code block and read the comments added:
- private void read_Click(object sender, RoutedEventArgs e)
- {
-
- string message = text.Text;
- string voiceName = "";
-
-
- if (comboBox.SelectedIndex != -1)
- {
- voiceName = (comboBox.SelectedItem).ToString();
- }
-
-
- int rate = (int)slider.Value;
- reader.Rate = rate;
-
-
- reader.SelectVoice(voiceName);
-
-
- if (!reading)
- {
- reader.SetOutputToDefaultAudioDevice();
- }
- else
- {
- MessageBox.Show("Previous reader is currently reading. Press 'Stop' to try stopping it and try again.");
- }
-
- reading = true;
-
- activePrompt = reader.SpeakAsync(message);
-
-
- reader.SpeakCompleted += (sander, ev) =>
- {
- reading = false;
- };
- }
Thus when the user presses Read button, it will set up the speaking configurations and then change the output type to the default audio device; that can be changed using Control Panel. In most cases the default device is the speaker, otherwise headphone (if attached) or other similar device.
I did say in this article, I would show you how to generate speech to hear through speakers (or default device) or to generate the audio samples in a Waveform format file to share over the network or to listen to it later or for
any other purpose. For this sake, I have implicitly changed the output to speakers in this function because we will change the output type to file later. Keep reading.
Stopping the speech
In the preceding code, you saw that each time the speech is initiated, a handle is captured to the current Prompt object that is later used to stop the speech process. The following event handler does that:
- private void stop_Click(object sender, RoutedEventArgs e)
- {
-
- if (reading)
- {
- reader.SpeakAsyncCancel(activePrompt);
- }
- }
The prompt is ed and is cancelled. If you want to allow to speak mutliple prompts, then you can also call SpeakAsyncCancelAll() that would cancel every instance of prompts running.
Generating a Waveform file
Another use of this library is to generate audio samples for your Text-to-Speech output. It can be shared over the network, streamed down to your users or stored in the file system for later use. Perhaps many other uses for the file generated.
I will show how to create the file, you can then
use System.Diagnostics.Process.Start("file-path.wav"); to listen to it programmatically, or open it using your Windows Explorer.
The following code snippet will work:
- private void save_Click(object sender, RoutedEventArgs e)
- {
- string message = text.Text;
-
-
- if (!reading)
- {
- reader.SetOutputToWaveFile("E:\\MyAudioFile.wav");
- reader.SpeakAsync(message);
- }
- else
- {
- MessageBox.Show("Previous reader is currently reading. Press 'Stop' to try stopping it and try again.");
- }
- }
It will change the output to a file, (
Remember:
File will be held by the program and will not be accessible by other programs, until the program is referencing the file for output) and would write the audio to the file. You can listen to the file later, whenever you want to.
You can also get the output as Streams. Please read the
SpeechSynthesizer object documentation on MSDN for more details.
Points of interest
While writing this article, I got to learn many things about speaking, voices and other text-to-speech deep concepts. Use the application project by downloading the preceding project sample, share it with friends and don't forget the next steps:
- Try loading PDF files into it.
- Try using Prompt objects to create a dialogue reading application.
Good luck everybody and happy coding. I hope I have helped you with this article.