Speech Recognition in Kinect

Vuyiswamb
Posted by in Kinect category on for Beginner level | Points: 250 | Views : 21548 red flag
Rating: 5 out of 5  
 2 vote(s)

There are few things that I have seen while developing Kinect examples, I wouldn't want to say apps, because I have not yet developed a full-fledged application. Kinect user interface, is different from the traditional user interface, where we will see buttons that needs to be clicked. Using voice you can control or do things that you used to do with a mouse, using hand gesture you can control your application like you used to do with a mouse.
In this article i will demonstrate to you, on how you can control your application using voice commands.


 Download source code for Speech Recognition in Kinect


Introduction

 
There are few things that I have seen while developing Kinect examples, I wouldn't want to say apps, because I have not yet developed a full-fledged application. Kinect user interface, is different from the traditional user interface, where we will see buttons that needs to be clicked. Using voice you can control or do things that you used to do with a mouse, using hand gesture you can control your application like you used to do with a mouse.
In this article i will demonstrate to you, on how you can control your application using voice commands.
 

Objective

 
The objective of this article is to demonstrate to you on how to control your application using voice commands instead of a mouse.   Authors Preface  
When I started creating the example of this article, I first created buttons and later i removed those buttons. The reason I finally removed the buttons, is because I did not need them anymore, I could control my example app using voice commands and the nice part is that I controlled the application using my mother tongue. This might be the last I article I will write this year, I will be going a small pause for two weeks, and after that I might give you a new one before the 3rd of January 2013.

Name Spaces

 
There are namespaces that you will need to add which you never added when following my previous article on the subject of Microsoft Kinect.

Figure 1.1
 
The common location of this namespace is
C:\Windows\assembly\GAC_MSIL\Microsoft.Speech\11.0.0.0__31bf3856ad364e35\Microsoft.Speech.dll 
 
After you have added the required namespace’s , we must setup our grammar file , So Basically we want to create an application that plays a video , but we don’t want to pause or play or stop the video using  a mouse or clicking a button, we want to speak commands and the application must respond otherwise.
 

Setup the Grammar File

 
The grammar file is just an xml file with the following tags
 

Rule

 
A rule definition is represented by the rule element. The id attribute of the element indicates the name of the rule and must be unique within the grammar (this is enforced by XML). 
 

ITEM

 
An item element can surround any expansion to permit a repeat attribute or language identifier to be attached. The weight attribute of item is ignored unless the element appears within a one-of element.
 

TAG


A tag is a legal rule expansion (a tag can also be declared in the grammar header - see S4.1).
A tag is an arbitrary string that may be included inline within any legal rule expansion. Any number of tags may be included inline within a rule expansion.
Tags do not affect the legal word patterns defined by the grammars or the process of recognizing speech or other input given a grammar.
Tags may contain content for semantic interpretation. The semantic interpretation processes may affect the recognition result. 
 

Grammar 

 
tag is a legal rule expansion (a tag can also be declared in the grammar header - see S4.1).
A tag is an arbitrary string that may be included inline within any legal rule expansion. Any number of tags may be included inline within a rule expansion.
Tags do not affect the legal word patterns defined by the grammars or the process of recognizing speech or other input given a grammar.
Tags may contain content for semantic interpretation. The semantic interpretation processes may affect the recognition result. 
 

Recap

 
Wikipedia has lots of articles on grammar file specifications. Basically this file contains the possible words that can be used to control our application. When we are done, we will create resource for our file and access it in our application as a resource as depicted below
 

Figure 1.2

You full grammar file for this example should look like this
 

Full Grammar File

 
<grammar version="1.0" xml:lang="en-US" tag-format="semantics/1.0-literals" xmlns="http://www.w3.org/2001/06/grammar">
 
    <rule id="PLAYrule" scope="public"> 
      <one-of>
      <item>
        <tag>PLAY</tag>
        <one-of>
          <item>Cala</item> <!--In Zulu(Mother Tounge) it means Start-->
          <item>Start Video</item>
          <item>Dlala</item> <!--In Zulu(Mother Tounge) it means Play-->
          <item>Play</item>
        </one-of>
      </item>
    </one-of>
  </rule>
  
 
  
   <rule id="Stoprule" scope="public"> 
      <one-of>
      <item>
        <tag>STOP</tag>
        <one-of>
          <item>IMA</item>          <!--In Zulu(Mother Tounge) it means STOP-->
          <item>Stop Video</item>
          <item>Stop</item>
          <item>Misa i Video</item>           <!--In Zulu(Mother Tounge) it means STOP the Video-->
        </one-of>
      </item>
    </one-of>
  </rule>

  <rule id="Pauserule" scope="public">
    <one-of>
      <item>
        <tag>PAUSE</tag>
        <one-of>
          <item>WAIT</item> 
          <item>Pause</item>
        </one-of>
      </item>
    </one-of>
  </rule>



  <rule id="Volumeup" scope="public">
    <one-of>
      <item>
        <tag>UP</tag>
        <one-of>
          <item>PHEZULU</item>           <!--In Zulu(Mother Tounge) it means UP-->
          <item>KODIMU</item>            <!--In SLANG SOTHO(Mother Tounge) it means UP-->
        </one-of>
      </item>
    </one-of>
  </rule>


  <rule id="Volumedown" scope="public">
    <one-of>
      <item>
        <tag>DOWN</tag>
        <one-of>
          <item>PHANSI</item>           <!--In Zulu(Mother Tounge) it means DOWN-->
          <item>KOTLASI</item>           <!--In SLANG SOTHO(Mother Tounge) it means DOWN-->
        </one-of>
      </item>
    </one-of>
  </rule>
</grammar>
  
As usually, you know that when creating Kinect applications we create a WPF application and add a reference to our friend “Microsoft.Kinect” library, if you don’t know this, I suggest you read my previous articles on the subject of Microsoft Kinect.

<Window x:Class="VoiceCommandsInKinect.MainWindow" 
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation" 
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" 
        Title="MainWindow" Height="572.244" Width="848.43"> 
    <Grid Margin="0,0,-5.6,4.4"> 
      
        <Label x:Name="Status" HorizontalAlignment="Left" VerticalAlignment="Top" Height="60" Width="100" Margin="0,0,748,478"  > 
        </Label> 
  
        <Image Source="..\image\Logo.png"   Height="60" Width="100" HorizontalAlignment="Center" VerticalAlignment="Top"  Margin="371,2,0,453"    /> 
        <MediaElement x:Name="VideoPlayer"  LoadedBehavior="Manual" UnloadedBehavior="Stop"    Margin="0,90,0,10"  ></MediaElement> 
    </Grid> 
</Window>    
This is just a simple media element with a name “VideoPlayer” and I decorated my window with a Microsoft Kinect logo so that our example looks cool. I always try to comment my code line by line where it might not make sense for the new reader in the technology. The following code is commented to the best of my ability, if you have any question, you can write on the comment and I will explain and add a comment where needed. 
using Microsoft.Kinect; 
using Microsoft.Speech.Recognition;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text; 
using System.Windows;
using System.Windows.Controls; 
using System.Windows.Media; 

namespace VoiceCommandsInKinect
{
    /// 
    /// Interaction logic for MainWindow.xaml
    /// 
    public partial class MainWindow : Window
    {


        #region "Constants"
   
        /// 
        /// Name of speech grammar corresponding to file. Note that the name must be the same, it is case sensative
        /// 
        //For the Play Functionality 
        private const string PLAYrule = "PLAYrule";
        //For the Stop Functionality 
        private const string Stoprule = "Stoprule";
        //For the Pause funtionality
        private const string Pauserule = "Pauserule";
        //for Volume down funtionality
        private const string Volumedownrule = "Volumedown";
        //for Volume up functionality
        private const string Volumeuprule = "Volumeup";
        
        /// 
        /// Speech recognizer used to detect voice commands issued by application users.
        /// 
        private SpeechRecognizer speechRecognizer;
 
        /// 
        /// Speech grammar used during Application.
        ///  
        private Grammar PlayGrammar; 
        private Grammar StopGrammar;
        private Grammar PauseGrammar;
        private Grammar VolumeupGrammar;
        private Grammar VolumedownGrammar;
        #endregion

        /// 
        /// Initializes a new instance of the MainWindow class.
        /// 
        public MainWindow()
        {
            InitializeComponent();
            //What should happen when the applicatioj is loaded
            Loaded += MainWindow_Loaded;
            //what should happen when the application is unloaded
            Unloaded += MainWindow_Unloaded;
            //What should happen when the application is closing
            Closing += MainWindow_Closing;

          
        }

        //Stop the Sensor when the application is being closed
        void MainWindow_Closing(object sender, System.ComponentModel.CancelEventArgs e)
        {
            sensor.Stop(); 
        }
        //Stop the Sensor when the application is being closed
        void MainWindow_Unloaded(object sender, RoutedEventArgs e)
        {
            sensor.Stop(); 
                
        }

        //get the First Sensor
        KinectSensor sensor = KinectSensor.KinectSensors[0];

        void MainWindow_Loaded(object sender, RoutedEventArgs e)
        {
          

            //Check if the Sensor is Connected
            if (sensor.Status == KinectStatus.Connected)
            {
                //Start the Sensor
                sensor.Start();
               
                //nice message with Colors to alert you if your sensor is working or not
                Status.Content = "Kinect Ready";
                Status.Background = new SolidColorBrush(Colors.Green);
                Status.Foreground = new SolidColorBrush(Colors.White);
                 
                // Create and configure speech grammars and recognizer  
                this.PlayGrammar = CreateGrammar(PLAYrule);
                this.StopGrammar = CreateGrammar(Stoprule);
                this.PauseGrammar = CreateGrammar(Pauserule);
                this.VolumedownGrammar = CreateGrammar(Volumedownrule); 
                this.VolumeupGrammar = CreateGrammar(Volumeuprule);

                //recognize the speech
                this.speechRecognizer = SpeechRecognizer.Create(new[] { PlayGrammar, StopGrammar, PauseGrammar ,VolumeupGrammar,VolumedownGrammar});

                if (null != speechRecognizer)
                {
                    this.speechRecognizer.SpeechRecognized += SpeechRecognized;

                    this.speechRecognizer.Start(sensor.AudioSource);
                }
            }
            else if (sensor.Status == KinectStatus.Disconnected)
            {
                //nice message with Colors to alert you if your sensor is working or not
                Status.Content = "Kinect Sensor is not Connected";
                Status.Background = new SolidColorBrush(Colors.Orange);
                Status.Foreground = new SolidColorBrush(Colors.Black);

            }
            else if (sensor.Status == KinectStatus.NotPowered)
            {//nice message with Colors to alert you if your sensor is working or not
                Status.Content = "Kinect Sensor is not Powered";
                Status.Background = new SolidColorBrush(Colors.Red);
                Status.Foreground = new SolidColorBrush(Colors.Black);
            }
            else if (sensor.Status == KinectStatus.NotReady)
            {//nice message with Colors to alert you if your sensor is working or not

                Status.Content = "Kinect Sensor is not Ready";
                Status.Background = new SolidColorBrush(Colors.Red);
                Status.Foreground = new SolidColorBrush(Colors.Black);

            } 
        }
 
 

        private void SpeechRecognized(object sender, SpeechRecognizerEventArgs e)
        { 
            //Play the Video
            const string Play = "PLAY";
            //Stop the Video 
            const string StopCommand  = "STOP";
            //Pause
            const string PauseCommand = "PAUSE";
            //Volume Down
            const string VolumedownCommand = "DOWN";
            //Volume Up
            const string VolumeupCommand = "UP";


            if (null == e.SemanticValue)
            {
                return;
            }

            // Handle game mode control commands
            switch (e.SemanticValue)
            {
                 
                case Play:
                    PlayVideo();
                    return;

                case StopCommand:
                    VideoPlayer.Stop();
                    return;

                case PauseCommand:
                    VideoPlayer.Pause();
                    return;

                case VolumedownCommand:

                    VideoPlayer.Volume = 0;
                    return;


                case VolumeupCommand:
                    VideoPlayer.Volume = 1;
                    return;
            } 

            // We only handle speech commands with an associated sound source angle, so we can find the
            // associated player
            if (!e.SourceAngle.HasValue)
            {
                return;
            } 
        }
          
 
        /// 
        /// Create a grammar from grammar definition XML file.
        /// 
        /// 
        /// Rule corresponding to grammar we want to use.
        /// Tha
        /// 
        /// New grammar object corresponding to specified rule.
        /// 
        private Grammar CreateGrammar(string ruleName)
        {
            Grammar grammar;

            using (var memoryStream = new MemoryStream(Encoding.ASCII.GetBytes(Properties.Resources.SpeechGrammar)))  //Access a Gramar File
            {
                grammar = new Grammar(memoryStream, ruleName);
            }

            return grammar;
        }
        //Function to Play a Video
        private void  PlayVideo()
        {
            VideoPlayer.Source = new Uri(@"D:\Articles\How to use Voice Commands in Kinect\VoiceCommandsInKinect\VoiceCommandsInKinect\KinectSDK.wmv", UriKind.Absolute);
            VideoPlayer.LoadedBehavior = MediaState.Manual;
            VideoPlayer.Play();
         

        }
    }
}

 

Demonstration

 
When you run your application, you will notice
 

Figure 1.3

And when I speak “PLAY” the video started playing
 

Figure 1.4
 

Figure 1.5

Figure 1.6
 
I was able to say Pause or use my mother tongue and said “IMA” it stopped. I have attached an example project that will guide you
 

Reference

 
http://www.w3.org/TR/speech-grammar/#S2
 
http://www.dotnetfunda.com/articles/article2050-introduction-to-microsoft-kinect.aspx
 

Conclusion

 
Thank you again for Visiting Dotnetfunda to learn about this exciting technology, more articles will come soon.
 
Microsoft Kinect is the Future.
 
Vuyiswa Maseko
Page copy protected against web site content infringement by Copyscape

About the Author

Vuyiswamb
Full Name: Vuyiswa Maseko
Member Level: NotApplicable
Member Status: Member,MVP,Administrator
Member Since: 7/6/2008 11:50:44 PM
Country: South Africa
Thank you for posting at Dotnetfunda [Administrator]
http://www.Dotnetfunda.com
Vuyiswa Junius Maseko is a Founder of Vimalsoft (Pty) Ltd (http://www.vimalsoft.com/) and a forum moderator at www.DotnetFunda. Vuyiswa has been developing for 16 years now. his major strength are C# 1.1,2.0,3.0,3.5,4.0,4.5 and vb.net and sql and his interest were in asp.net, c#, Silverlight,wpf,wcf, wwf and now his interests are in Kinect for Windows,Unity 3D. He has been using .net since the beta version of it. Vuyiswa believes that Kinect and Hololen is the next generation of computing.Thanks to people like Chris Maunder (codeproject), Colin Angus Mackay (codeproject), Dave Kreskowiak (Codeproject), Sheo Narayan (.Netfunda),Rajesh Kumar(Microsoft) They have made vuyiswa what he is today.

Login to vote for this post.

Comments or Responses

Login to post response

Comment using Facebook(Author doesn't get notification)