Le Post Infeeny

Les articles des consultants et experts Infeeny

Exploring Microsoft Speech APIs

This article introduces the speech APIs, one of the services updated in Windows 10 and powered by the project Oxford. This article is divided into several parts:


I.Speech APIs features 

These APIs provide two major features :

  • Speech Recognition : Speech To Text (STT)

It converts spoken audio to text. The APIs can recognize audio coming from the microphone in real-time, or from an audio file.

  • Speech Synthesizer : Text To Speech (TTS)

It converts text to spoken audio when applications need to talk back to their users.

Microsoft Speech Platform

 

The Speech APIs are included in Windows 10 libraries, and also provided by the project oxford services, which require an Azure subscription.

II. Project Oxford Speech APIs 

These APIs are in a beta version, they work by sending data to Microsoft servers in the cloud, to use them we must have an Azure account. They also offer the Intent Recognition feature which can convert spoken audio to intent.

To use these APIs, follow the steps below:

  1. Using an Azure account, go to the Market Place to purchase the speechAPIs Service (which is for free 😉 ), then retrieve the primary or secondary key. Just in case you are using the Azure DreamSpark subscription, don’t be surprised if you don’t find this service. Unfortunately this type of account does not give access to Oxford Services.
  2. Download the Speech SDK of project oxford from here! if you are targeting another platform rather than Windows, have a look here you will find what you are looking for.

 

  • Speech To Text (STT):

The oxford version of Speech APIs offers two choice to make the STT:

  1. REST API
  2. Client library

When using the REST API, we only get one recognition result back at the end of the session, but in the case of a client library, we also get partial result before getting the final recognition.

Setting up speech recognition begins with the Speech Recognition Service Factory. By using this factory, we can create an object which can make a recognition request to the Speech Recognition Service. This factory can create two types of objects:

  1. A Data Recognition Client : used for speech recognition with data (for example from an audio file). The data is broken up into buffers and each buffer is sent to the Speech Recognition Service.
  2. A Microphone Recognition Client : used for speech recognition from the microphone. The microphone is turned on, and data is sent to the Speech Recognition Service.

When creating a client from the factory, it can be configured in one of two modes:

  1. In ShortPhrase mode, an utterance may only be up to 15 seconds long, As data is sent to the server, the client will receive multiple partial results and one final multiple N-best choice result.
  2. In LongDictation mode, an utterance may only be up to 2 minutes long. As data is sent to the server, the client will receive multiple partial results and multiple final results.

Also the client can be configured for one of the following several languages:

  • American English: « en-us »
  • British English: « en-gb »
  • German: « de-de »
  • Spanish: « es-es »
  • French: « fr-fr »
  • Italian: « it-it »
  • Mandarin: « zh-cn »

Now, time to code 😀 you can implement the code below in a WPF app


string stt_primaryOrSecondaryKey = ConfigurationManager.AppSettings["primaryKey"];

// We have 2 choices : LongDictation or ShortPhrase
SpeechRecognitionMode stt_recoMode = SpeechRecognitionMode.LongDictation;

// For a Speech recognition from a Microphone
MicrophoneRecognitionClient stt_micClient = SpeechRecognitionServiceFactory.CreateMicrophoneClient(stt_recoMode, "fr-fr",
stt_primaryOrSecondaryKey);

// For a Speech recognition from a data like wav file
DataRecognitionClient stt_dataClient = SpeechRecognitionServiceFactory.CreateDataClient(stt_recoMode, "fr-fr",
stt_primaryOrSecondaryKey);

Then we must subscribe some events to get the result of the recognition. The Microphone Recognition Client & Data Recognition Client have the same Events as follow:

  • OnConversationError : Event fired when a conversation error occurs
  • OnIntent : Event fired when a Speech Recognition has finished, the recognized text has
    been parsed with LUIS for intent and entities, and the structured JSON result is available.
  • OnMicrophoneStatus : Event fired when the microphone recording status has changed.
  • OnPartialResponseReceived : Event fired when a partial response is received
  • OnResponseReceived : Event fired when a response is received

Inside the events, we can do whatever we want, displaying the result in a textBox for ex. and more…


// Event handlers for speech recognition results
sst_micClient.OnResponseReceived += OnResponseReceivedHandler;
sst_micClient.OnPartialResponseReceived += OnPartialResponseReceivedHandler;
sst_micClient.OnConversationError += OnConversationErrorHandler;
sst_micClient.OnMicrophoneStatus += OnMicrophoneStatus;

// Data Client event from an audio file for ex.
sst_dataClient.OnResponseReceived += OnResponseReceivedHandler;
sst_dataClient.OnPartialResponseReceived += OnPartialResponseReceivedHandler;
sst_dataClient.OnConversationError += OnConversationErrorHandler;

Now how do we start or stop the speech recognition? It’s simple, we just need to make a method call


// Turn on the microphone and stream audio to the Speech Recognition Service
sst_micClient.StartMicAndRecognition();

// Turn off the microphone and the Speech Recognition
sst_micClient.EndMicAndRecognition();

To convert an audio file to text, it’s easy, we just need to convert the file into a byte array and send it to the server for the recognition, like shown below:


if (!string.IsNullOrWhiteSpace(filename))
using (FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
int bytesRead = 0;
byte[] buffer = new byte[1024];

try
{
do
{
bytesRead = fileStream.Read(buffer, 0, buffer.Length);
// Send of audio data to cloud service.
sst_dataClient.SendAudio(buffer, bytesRead);
} while (bytesRead > 0);
}
finally
{
m_dataClient.EndAudio();
}
}

  • Text To Speech (TTS)

The TTS feature of project oxford can be used only through the REST API, and we have a complete example here.

The end-point to access the service is: https://speech.platform.bing.com/synthesize

The API uses HTTP POST to send audio back to the client. The maximum amount of audio returned for a given request will not exceed 15 seconds.

For any question about using this API, please refer to TTS through REST API documentation

III. Windows 10 Speech APIs

Windows 10 Speech APIs support all Windows 10 based devices including IoT hardware, phones, tablets, and PCs.

The Speech APIs in Windows 10 are represented under this two namespaces :

Requirement 

  1. Windows 10
  2. Visual Studio 2015
  3. Make sure that Windows Universal App Development Tools are installed in VS2015.

First of all we have to create a Windows 10 Universal application project in visual studio : New Project dialog box, click Visual C# > Windows > Windows Universal > Blank App (Windows Universal).

With Windows 10, applications don’t have the permission to use the microphone by default, so you must at first change the parameters of the universal application as follows:

Double click on the file Package.appxmanifest > Capablilites > Microphone > select the check box.

Note: The Windows 10 Speech APIs are using the languages installed in the Operating System.

  • Speech To Text (STT)

The STT feature using Windows 10 APIs works in online mode, if we want to make it available in offline mode we have to provide the necessary grammar manually.

To make this feature works we have 3 steps:

  • Create a SpeechRecognizer object,
  • Create an other object from SpeechRecognitionConstraint type and add it to the SpeechRecognizer object already created,
  • Compile the constraints.

SpeechRecognizer supports 2 types of recognition sessions:

  1. Continuous recognition sessions for prolonged audio input. A continuous session needs to be either explicitly ended or automatically times out after a configurable period of silence (default is 20 seconds).
  2. Speech recognition session for recognizing a short phrase. The session is terminated and the recognition results returned when a pause is detected by the recognizer.

Like shown in the code below


SpeechRecognizer speechRecognizer = new SpeechRecognizer();
// Here we choose a simple constraints scenario of dictation
var dictationConstraint = new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.Dictation, "dictation");
speechRecognizer.Constraints.Add(dictationConstraint);
SpeechRecognitionCompilationResult result = await speechRecognizer.CompileConstraintsAsync();

A continuous recognition session can be started by calling SpeechRecognizer.ContinuousRecognitionSession.StartAsync() method and can be stoped by calling speechRecognizer.ContinuousRecognitionSession.StopAsync(). The SpeechRecognizer.ContinuousRecognitionSession object provides two events :

  • Completed : Occurs when a continuous recognition session ends.
  • ResultGenerated : Occurs when the speech recognizer returns the result from a continuous recognition session.

We have another event tied to the speechRecognizer object, which is the HypothesisGenerated event, occurs when a recognition result fragment is returned by the speech recognizer.

The code below show how to start the recognition:


public async void StartRecognition()
{
// The recognizer can only start listening in a continuous fashion if the recognizer is urrently idle.
// This prevents an exception from occurring.
if (speechRecognizer.State == SpeechRecognizerState.Idle)
{
try
{
await speechRecognizer.ContinuousRecognitionSession.StartAsync();
}
catch (Exception ex)
{
var messageDialog = new Windows.UI.Popups.MessageDialog(ex.Message, "Exception");
await messageDialog.ShowAsync();
}
}
}

To stop the recognition :


public async void StopRecognition()
{
if (speechRecognizer.State != SpeechRecognizerState.Idle)
{
try
{
await speechRecognizer.ContinuousRecognitionSession.StopAsync();

TXB_SpeechToText.Text = dictatedTextBuilder.ToString();
}
catch (Exception exception)
{
var messageDialog = new Windows.UI.Popups.MessageDialog(exception.Message, "Exception");
await messageDialog.ShowAsync();
}
}
}

 

  • Text To Speech (TTS)

This feature is available in offline and online mode, to make it works we have to create a SpeechSynthesizer object, then we set the speech synthesizer engine (voice) and generate a stream from the speechSynthesizer.SynthesizeTextToStreamAsync method by passing the text we want to read in parameter.

To read the stream we have to use a MediaElement object, like shown in the code below:


SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer();
speechSynthesizer.Voice = SpeechSynthesizer.DefaultVoice;
//Init the media element which will wrap tthe text to speech
MediaElement mediaElement = new MediaElement();
//We have to add the mediaElement to the Grid otherwise it won't work
LayoutRoot.Children.Add(mediaElement);

var stream = await speechSynthesizer.SynthesizeTextToStreamAsync("Hello World!");
mediaElement.SetSource(stream, stream.ContentType);
mediaElement.Play();

Managing voice commands using the STT and TTS features

We can make the applications implementing these APIs more interactive, by passing some commands using voice. Once the command is executed, the app will confirm this, using the TTS feature. To do that, we can use the STT events, like shown in the code below:


private async  void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args)
{
// We can ignore the generated text using the level of the conversion confidence (Low,Medium,High)
if (args.Result.Confidence == SpeechRecognitionConfidence.Medium || args.Result.Confidence == SpeechRecognitionConfidence.High)
{
// The key word to activate any command
//ex. user says : text red, the key word is text and the command is red
string command = "text";

if (args.Result.Text.ToLower().Contains(command))
{
string result = args.Result.Text.ToLower();
string value = result.Substring(result.IndexOf(command) + command.Length + 1).ToLower();
//The generated text may ends with a point
value = value.Replace(".", "");
switch (value)
{
case "in line": case "line": case "in-line":
dictatedTextBuilder.AppendFormat("{0}", Environment.NewLine);
await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
{
ReadSpecilaCommandToUser("Carriage return command is activated");
});
break;
case "blue":
await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
{
TXB_SpeechToText.Foreground = new SolidColorBrush(Colors.Blue);
ReadSpecilaCommandToUser("Blue color command is activated");
});
break;
case "red":
await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
{
TXB_SpeechToText.Foreground = new SolidColorBrush(Colors.Red);
ReadSpecilaCommandToUser("Red color command is activated");
});
break;
case "green":
await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
{
TXB_SpeechToText.Foreground = new SolidColorBrush(Colors.Green);
ReadSpecilaCommandToUser("Green color command is activated");
});
break;
case "black":
await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
{
TXB_SpeechToText.Foreground = new SolidColorBrush(Colors.Black);
ReadSpecilaCommandToUser("Black color command is activated");
});
break;
}
}
else
{
dictatedTextBuilder.Append(args.Result.Text + " ");
}

await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
{
TXB_SpeechToText.Text = dictatedTextBuilder.ToString();
});

}

}

private async void ReadSpecilaCommandToUser(string text)
{
if (!string.IsNullOrWhiteSpace(text))
{
using (SpeechSynthesizer speech = new SpeechSynthesizer())
{
speech.Voice = SpeechSynthesizer.AllVoices.FirstOrDefault(item => item.Language.Equals(language.LanguageTag));
SpeechSynthesisStream stream = await speech.SynthesizeTextToStreamAsync(text);

mediaElement.SetSource(stream, stream.ContentType);
mediaElement.Play();
}
}
}

 


IV. Demo

The video below shows how to edit text by voice, the app is using the Windows 10 Speech APIs:

Going further

 

Speech APIs – Universal Windows Platform:

SpeechAPIs – Project Oxford: https://github.com/Microsoft/ProjectOxford-ClientSDK/tree/master/Speech

Project Oxford : https://www.projectoxford.ai

Laisser un commentaire

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:

Logo WordPress.com

Vous commentez à l'aide de votre compte WordPress.com. Déconnexion / Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion / Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion / Changer )

Photo Google+

Vous commentez à l'aide de votre compte Google+. Déconnexion / Changer )

Connexion à %s

%d blogueurs aiment cette page :