Guidelines for Developers
By Bryan Mackenzie (Intel)
March 19, 2014
Download PDF [1.48MB]
Introduction
Natural human-computer interaction is becoming more prevalent. Leap Motion’s controller and Microsoft’s Kinect* are both solutions for creating natural human-computer interaction. There is, however, a barrier for entry when developing these applications that impacts both hardware and software. The hardware is costly to develop and the software, complex. As it stands, the supported cameras for all solutions are peripheral accompaniments. Moving forward, however, Intel-supported small form factor devices will have supported integrated depth cameras and a supporting API via the Intel® Perceptual Computing SDK.
This article focuses on simple integration of hand tracking and voice recognition into the Unity 3D environment using the SDK. More specifically, the article covers how to interact with a game interface using voice recognition and how to control a three-dimensional object on the screen using hand tracking. You should have a basic understanding of Unity 3D and C#. References in this article and additional resources are listed at the end in the Resource section.
Basics of Perceptual Computing
The Intel Perceptual Computing SDK provides a library of pattern detection and recognition algorithms. The SDK includes a class library that simplifies the programming required to use this functionality. The SDK makes use of the UtilMPipeline interface, which offers simple but limited functionality for hand and finger tracking, voice recognition, and facial analysis. The use of this interface allows developers to focus on content creation instead of having to develop interfaces from the start.
Modules
The UtilMPipeline
interface provides access to two core modules, voice and gesture, which will be the focus here. To use a particular module, you must first specify the mode or context of the intended interaction. PXCUPipeline
provides an enumeration of all available modes. When using the voice recognition module, there are four modes: dictation, TTS, voice synthesis, and the command and control mode. This article focuses on command and control, which is a list of voice commands such as “Play” and “Pause” that allow control and interaction with an application. To make use of this mode, the developer must first establish a command list. This is a predefined list of words that are accepted as user input, once recognized. In the context of games, this could be used for menu navigation and gameplay. For example, one might say “Freeze” to interact with an NPC or “Menu” to pause and bring up a game’s menu.
The second module is the Gesture recognition module. The Gesture module offers four different processing methods: blob information, geometric node tracking, pose/gesture notification, and alert notifications. This article covers three of the four available processing types: geometric node tracking, alert notifications, and gesture notifications.
Geometric node tracking is used in conjunction with geometric node data, such as labels that identify which areas of the target to track. Once tracked and recognized, alert notifications are used to provide event notification of the recognition occurrence. Gesture notifications work in a similar fashion; an application can be notified of a recognized gesture. The gesture module can be used to move and interact with the game. An example of this is the traditional “swipe” to advance a menu screen or tracking the hand or head to move an object or camera (see the “head couple perspective” demo in the SDK root folder under Framework).
Setting Up Unity 3D
Before diving into the Intel Perceptual Computing SDK, we must first understand the basics of Unity 3D and what we need to begin using the SDK. It’s important to understand the concept of GameObjects and the three primary Unity 3D functions before moving forward.
A GameObject is a container that holds various Components that make up the GameObject. These components consist of items such as lighting, rigid body physics, and so on. The three functions we will work with are Start, Update, and OnDisable.
The Start
function is a function that is called once when the GameObject has been loaded, while initialization is done in the Awake
function.
The Update
function is called once every 1/60th of a second, and the OnDisable
function is called when a GameObject is destroyed and is typically used for cleanup.
These three functions are where the logic for the human-computer input will be placed. To interface with the libraries after installing the SDK, you must place the files located at “C:\Program Files(x86)\Intel\PCSDK\framework\Unity\hellounity\Assets\Plugins” into your projects “Asset/Plugins” folder. Create the Plugins folder if needed.
Preparing to Program with the SDK
The first thing that needs to be done to utilize the SDK is to create an instance of PXCUPipeline
. We then specify the mode of operation. Below we have created two private variables, one that will hold our instance of PXCUPipeline
and another that specifies the mode of operation, which in this case is gesture recognition. Then inside Unity’s Start
function, we call the newPXCUPipeline()
constructor and initialize our pipe
object. Other options available to the developer are to call Unity’s OnEnable
function, which is called when the object is activated, or the Awake
function, which is called once when all GameObjects
have been initialized.
private PXCUPipeline pipe; private PXCUPipeline.Mode mode=PXCUPipeline.Mode.GESTURE; void Start () { pipe = new PXCUPipeline(); if (!pipe.Init(mode)) { print("Unable to initialize gesture mode"); return; } }
Next we create the Update
function, where the logic for the interaction will be processed. The Unity Update
function is called once per frame. It is important to note that blocking calls should never be placed in any Unity 3D update function, as this will yield undesirable results. This is because Unity is not multi-threaded, and blocking the main thread causes unresponsiveness in the application.
Consider instead the use of an asynchronous callback method. In the Update
function, the first thing we do is attempt to acquire a frame for processing using AquireFrame
. The AquireFrame
takes a Boolean that specifies whether the call should wait for a frame before continuing (blocking) or return and try again next frame. The call returns true if a frame is acquired and results are available for processing; otherwise it returns false. In the Unity environment, we set the parameter to false
.
As mentioned before, blocking should never occur on Unity’s main thread. Once valid data has been acquired, a lock is placed on the frame results. When processing of the frame results has completed, the lock must be released using the ReleaseFrame
function so we may acquire new results for the next frame. Frames are sampled at a rate of 33 milliseconds (~30 frames per second).
When image and audio processing are both enabled, frame acquisition alternates between audio and image processing, but both maintain this sample rate. If frame processing consumes more than the sample rate, the frame lock will still be intact, thereby dropping frames until the lock is released.
void Update() { if (!pp.AcquireFrame(false)) return; pp.ReleaseFrame(); }
Last, ensure that when the GameObject is destroyed, we close and release pipeline resources. This is done in the OnDisable function, which is called once when the object in question is destroyed.
void OnDisable() { pipe.Close(); pipe.Dispose (); }
Now that we understand the flow and how Unity interfaces with the SDK, we will move on to hand tracking.
Hand Tracking
When discussing hand tracking, it’s important to understand the GeoNode structure and the concept of labels. A Label defines how body-related objects should be identified in the scene and are broken into two major areas: the full-body label and the hand-detail label. The full-body label provides information pertaining to skeletal positioning within the view, while the hand-detail label provides detailed information of the hand object. The GeoNode
structure describes the geometric node, which gives data relevant to the object being tracked, such as confidence level, position, and label information.
In the code listing below, assuming that the pipeline has been created and initialized as above, we create the necessary body and hand label variables that specify the object to track. In this case, we are looking for the middle of the first hand to be recognized. In the Update
function, we attempt to acquire a frame for processing, making sure we have an early out if no frame is acquired. If a frame is acquired successfully, we call the QueryGeoNode function. The QueryGeoNode
function returns the details of the GeoNode
structure. The call takes two parameters: (1) a reference to the GeoNode
structure where results data will be stored and (2) the OR hand label, which again specifies that we are looking for the center of the primary hand. When the query for the hand center-position returns true, we have valid data stored in our instance of the GeoNode
struct. Using the GeoNode
struct data, we can begin to customize the user interaction by getting the position of the user’s hand, then find and map it to actions or objects in our application. After we have processed the results, we must make sure to release the frame and resources.
private PXCMGesture.GeoNode ndata; private PXCMGesture.GeoNode.Label bodyLabel = PXCMGesture.GeoNode.Label.LABEL_BODY_HAND_PRIMARY; private PXCMGesture.GeoNode.Label handLabel = PXCMGesture.GeoNode.Label.LABEL_HAND_MIDDLE; void Update() { if (!pp.AcquireFrame(false)) return; if (pp.QueryGeoNode(bodyLabel | handLabel, out ndata)) { //Get the standard hand position HandPosition.x = ndata.positionWorld.x; HandPosition.y = ndata.positionWorld.y; HandPosition.z = ndata.positionWorld.z; } pp.ReleaseFrame(); }
Another example of querying for hand data is to determine the openness of the hand (again done in the Update
function). In the code below, we are querying whether the primary hand is open or closed. After the GeoNode
structure has successfully acquired the frame, the LABEL_OPEN and LABEL_CLOSED labels are used to determine the hand openness. The practical usage for this might be to select an option on a menu.
if (pp.QueryGeoNode(PXCMGesture.GeoNode.Label.LABEL_BODY_HAND_PRIMARY,out ndata)) { if(ndata.opennessState == PXCMGesture.GeoNode.Openness.LABEL_OPEN) isOpen = true; if(ndata.opennessState == PXCMGesture.GeoNode.Openness.LABEL_CLOSE) isOpen = false; }
Now that we understand how to track a user’s hand, let’s discuss usage. As seen in the demo, hand tracking is utilized to control or interact with an object of interest—in this case a reticule. However, usage is not limited to games. Hand tracking data can be used to replace conventional mouse input. This brings about a special consideration: when and how to engage tracking.
Typically, when a user is done with traditional input, they release the mouse and the application takes no further input. When designing for hand tracking, the camera is constantly tracking a user’s hand. This poses a problem when the user’s action is not intended for the application, such as reaching for a glass or a phone, so be mindful of how and when to engage and release hand tracking. A possible solution might be to have the user explicitly gesture for control, perhaps disengaging with a closed hand and engaging with a wave gesture. With that possibility raised, let’s move to Gesture Recognition.
Gesture Recognition
The Gesture module makes use of the QueryGesture function in much the same way as the QueryGeoNode
function mentioned above. The PXCUPipeline
mode remains set to Gesture. The QueryGeoNode
function takes two parameters, a reference to the PXCMGesture.Gesture
struct and a label that stipulates the gesture to be recognized. The code below makes use of the LABEL_ANY label, which states that we are looking for any gesture. This is done so that if any gesture is recognized, we can move on and determine which gesture has been identified and how it should be handled. The QueryGesture
returns true when the Gesture struct holds valid gesture or pose data; otherwise it will return false. The example below uses the “Big 5” gesture to pause a game. The gesture is recognized when the user extends the arm, palm forward with fingers evenly spread.
PXCMGesture.Gesture gdata; if (pp.QueryGesture(PXCMGesture.GeoNode.Label.LABEL_ANY, out gdata)) { if(gdata.label == PXCMGesture.Gesture.Label.LABEL_POSE_BIG5) gamePause = !gamePause; }
When a particular gesture is recognized, we pause the game and stop acting on user input. Keeping the user experience in mind, we do not detach or stop tracking the hand; action is simply not taken on the incoming data. This allows for a faster “re-connection” once the user re-engages with the application. Another route is to utilize voice recognition to engage and disengage with user input. In this specific sample, voice recognition is primarily used to interact with the interface.
Voice Recognition
When enabling command and control using voice, we must specify that the mode of operation is voice recognition. In the Unity Start
function, once the PXCUPipeline
has been initialized, we specify and set a command list. A command list is a developer-defined string array of recognized commands for the application.
private PXCUPipeline pipe; private PXCUPipeline.Mode mode=PXCUPipeline.Mode.VOICE_RECOGNITION; void Start () { pipe = new PXCUPipeline(); if (!pipe.Init(mode)) { print("Unable to initialize voice mode"); return; } pipe.SetVoiceCommands(new string[]{ "fire", "pause", "play"}); }
Similar to the gesture recognition GeoNode
struct, the PXCMVoiceRecognition.Recognition struct returns the voice commands and dictation details. The structure holds two members we are concerned with: (1) the label member that refers to the most likely recognized command from the command list, and (2) the confidence member that holds the confidence value of the recognized command. The QueryVoiceRecognized function takes one parameter, a reference to the PXCMVoiceRecognition.Recognition
struct that stores the processed results. Once the data processing is complete, we make sure resources are released.
void Update () { if (!pipe.AcquireFrame(false)) return; PXCMVoiceRecognition.Recognition rdata; if (pipe.QueryVoiceRecognized(out rdata)) print("label = " + rdata.label); pipe.ReleaseFrame(); } void OnDisable() { pipe.Close(); pipe.Dispose (); }
Using the stored data, we are able to determine which command was given from the predefined command list. This might be useful for issuing voice commands to control the in-game menu or, as previously mentioned, to control when user actions are acted on.
Conclusion
This paper describes how to use the Intel Perceptual Computing SDK within Unity games to create interaction using hand tracking and voice and gesture recognition. We also discussed special considerations when engaging and disengaging from the users when actions are not intended for the application. In the introduction, we stated that building applications that take advantage of human-computer interaction is a complex endeavor. This paper intended to show how using the UtilMPipeline
interface from the SDK makes the task manageable, allowing developers to easily add intuitive interaction with applications.
Intel® Perceptual Computing Technology and Intel® RealSense™ Technology
This paper was written to describe the use of the Intel® Perceptual Computing SDK to create new intuitive, natural user interfaces with the Unity 3D Environment. At CES 2014, Intel announced Intel® RealSense™ Technology, a new name and brand for Perceptual Computing. Look for the new Intel® RealSense™ SDK as well as Ultrabook™ devices shipping with Intel® RealSense™ 3D Cameras embedded in them.
Resources
Intel® Perceptual Computing SDK Getting Started Guide:
http://software.intel.com/sites/default/files/f3/4e/perc-gettingstarted-11-27.pdf
Intel® Perceptual Computing SDK Reference manual:
http://software.intel.com/sites/landingpage/perceptual_computing/documentation/html/
About the Author
Bryan Mackenzie is a software engineer in the Developer Relations Division at Intel. He helps deliver leading-edge user experiences with optimal performance and power for all types of consumer applications. In his spare time, Bryan focuses on his passion for game development, crafting experiences across a variety of platforms.