XWand: UI for Intelligent Environments
The XWand is a novel wireless sensor package that enables styles of natural interaction with intelligent environments. For example, a user may point the wand at a device and control it using simple gestures. The XWand system leverages the intelligence of the ubiquitous computing environment to best determine the user's intention.

Work on the XWand can be divided into two broad categories: the design of the hardware device, and the design of the user experience and software system that uses the hardware device. Prototypes of both halves of the system have been developed.
Hardware device
The XWand hardware device includes a custom printed circuit board (PCB) with a variety of off-the-shelf sensors, including a 3-axis magnetometer, a 2-axis MEMS accelerometer, and a 1-axis piezoelectric gyroscope. The output of these sensors is collected and formatted by an onboard PIC microcontroller and passed to a 418MHz FM transceiver. A base station (not shown) receives data packets from the wand at about 50Hz, and passes the sensor readings to the host PC via RS-232. The wand also has 2 visible LEDs for feedback, a pushbutton for user input, and two IR LEDs for position tracking.

The magnetometer and accelerometer readings may be combined on the host side to obtain the true 3d orientation of the wand with respect to the room.
The IR LEDs support 3d tracking via external cameras. The PIC is programmed to flash the IR LEDs at a predefined rate, such that simple image processing software on the host PC can recover the 2d position of the wand in each camera view. This 2d information from multiple cameras is combined to find the 3d position of the IR LEDs.
The RF part on the wand may send as well as receive data. Presently the wand uses a call/response protocol, in which the host PC sends a request for data, and the wand sends a data packet back. The bi-directional aspect allows for sending commands from the host to the wand to, for example, turn the on-board LEDs on and off, and it allows for multiple wands to share the same frequency.
Software and user experience
Software on the host side computes the orientation and position of the wand from the raw sensor readings. Together with the 3d position of various targets in the room (for example, a controllable light), the system can determine if the wand is pointed at a known object in the room. The 3d locations of objects in the room can be trained up by pointing at the object with the wand a number of times from different positions in the room. This is presently done in a special training mode.

training a target by pointing at an object with the wand
During runtime the system must examine what the user is doing with the wand and decide what to do. Presently the system represents what the user is referring to over time. This is the "referent". Presently the referent is determined by what the wand is currently pointing at or by automatic speech recognition (SAPI). For example, you can point at the light, or say the word "light". The system also represents the "command" that goes with the referent. The command may be determined through gesture recognition on the wand sensors, speech recognition, button clicks and so on. For the light, a possible command might be "turn on". The system then combines the referent and the command into an "action", which is simply a symbolic representation of a device control command. Continuing with the light example, this might be an X10 lamp module command to turn on.
Presently the combination of sensors, referent, and command and so on is performed by a Dynamic Bayes Net, a computational architecture that allows such calculations to be done probabilistically over time. One of the research goals of the XWand project is in the appropriate computational architecture for reasoning in multimodal systems with noisy sensors.

dynamic Bayes net for action selection
One consequence of representing the command and referent separately is that each may be specified in a variety of ways (pointing, speech, gesture, button click, etc.), resulting in a wide variety of ways to effect the same action. For example, the user may point at the light and click the button to turn the light one, or the user may say "light" and then say "turn on", or say "light'" and then perform a turn on gesture, the user may say "turn on" and then point at the light, etc. All result in the same action.
Applications
X10 Lamp control: Point at a light, click the button, and the light turns on or off. Or say "turn on" or "turn off". Gesture up or down to set the light level (dim or brighten).
MediaPlayer control: Point at the MediaPlayer window, click button to set the player playing. Or say "play". Click again to pause. Gesture up or down to move through the playlist. Roll left or right to change the volume.
WandMouse: Point at the display and click. The wand then controls the windows cursor, with the wand button functioning as the mouse left button. Click the large "Disable" button to exit WandMouse mode.
ColorKinetics light control: Point at the light to activate it and invoke a sound. Roll left or right to change the color of the light. Fun for the whole family!
Research questions
What kinds of interactions do users expect? Which are they willing to learn?
Can the wand be used to contextualize automatic speech recognition?
How to combine multiple sources of noisy information to arrive at a single interpretation?
Can a single set of gestures span the majority of applications?
Is the position of the wand really needed?
Documents and links
ADXL 202 accelerometer datasheet
HMC 1023 magnetometer datasheet
last edited on 04/26/2004