Home > Sample chapters

An Introduction to the Kinect Sensor

Page 1 of 5 Next >
Learn how the Kinect sensor generates data about the world around it, the key components of the Kinect sensor and how they work, and how the sensors and the Kinect provide useful signals to a connected computer or console, in this chapter from Start Here! Learn the Kinect API.

After completing this chapter, you will:

  • Understand how the Kinect sensor generates data about the world around it

  • Identify the key components of the Kinect sensor and how they work

  • Appreciate how the sensors and the Kinect provide useful signals to a connected computer or console

The Kinect Sensor

UNTIL RECENTLY COMPUTERS HAD A very restricted view of the world around them, and users had very limited ways of communicating with computers. Over the years, computers have acquired cameras and audio inputs, but these have been used mostly for unrecognized input; computers can store and play such content, but it has been very difficult to make computers understand input in these forms.

For example, when people hear a sound, they can make judgments about the distance and direction of the sound source relative to their own position. Until recently, computers had more trouble making such judgments. Audio information from a number of microphones does provide considerable information about the distance and direction of the audio source, but determining this information is difficult for programs to do. Similarly, a video picture provides an image of the environment for the computer to analyze, but a computer has to work very hard to extract information about the objects in pictures or video because an image shows a flat, two-dimensional representation of a three-dimensional world.

Kinect changes all this. The Kinect sensor bar contains two cameras, a special infrared light source, and four microphones. It also contains a stack of signal processing hardware that is able to make sense of all the data that the cameras, infrared light, and microphones can generate. By combining the output from these sensors, a program can track and recognize objects in front of it, determine the direction of sound signals, and isolate them from background noise.

Getting Inside a Kinect Sensor

To get an idea of how the Kinect sensor works, you could take one apart and look inside. (Don’t do that. There are many reasons why taking your Kinect apart is a bad idea: it’s hard to do, you will invalidate your warranty, and you might not be able to restore it to working condition. But perhaps the best reason not to take it apart is that I’ve already done it for you!)

Figure 1-1 shows a Kinect sensor when it is “fully dressed.”

Figure 1-1

Figure 1-1 A Kinect sensor.

Figure 1-2 shows a Kinect with the cover removed. You can see the two cameras in the middle and the special light source on the left. The four microphones are arranged along the bottom of the sensor bar. Together, these devices provide the “view” the Kinect has of the world in front of it.

Figure 1-2

Figure 1-2 A Kinect sensor unwrapped.

Figure 1-3 shows all the hardware inside the Kinect that makes sense of the information being supplied from all the various devices.

Figure 1-3

Figure 1-3 The Kinect sensor data processing hardware.

To make everything fit into the slim bar form, the designers had to stack the circuit boards on top of each other. Some of these components produce quite a bit of heat, so a tiny fan that can be seen on the far right of Figure 1-3 sucks air along the circuits to keep them cool. The base contains an electric motor and gear assembly that lets the Kinect adjust its angle of view vertically.

Now that you have seen inside the device, you can consider how each component helps the Kinect do what it does, starting with the “3D” camera.

The Depth Sensor

Kinect has the unique ability to “see” in 3D. Unlike most other computer vision systems, the Kinect system is able to build a “depth map” of the area in front of it. This map is produced entirely within the sensor bar and then transmitted down the USB cable to the host in the same way as a typical camera image would be transferred—except that rather than color information for each pixel in an image, the sensor transmits distance values.

You might think that the depth sensor uses some kind of radar or ultrasonic sound transmitter to measure how far things are from the sensor bar, but actually it doesn’t. This would be difficult to do over a short distance. Instead, the sensor uses a clever technique consisting of an infrared projector and a camera that can see the tiny dots that the projector produces.

Figure 1-4 shows the arrangement of the infrared projector and sensor.

Figure 1-4

Figure 1-4 The Kinect infrared projector and camera.

The projector is the left-hand item in the Figure 1-4. It looks somewhat like a camera, but in fact it is a tiny infrared projector. The infrared camera is on the right side of Figure 1-4. In between the projector and the camera is an LED that displays the Kinect device status, and a camera that captures a standard 2D view of the scene. To explain how the Kinect sensor works, I’ll start by showing an ordinary scene in my house. Figure 1-5 shows my sofa as a person (okay, a camera) might see it in a room.

Figure 1-5

Figure 1-5 My sofa.

In contrast, Figure 1-6 shows how the Kinect infrared sensor sees the same view.

Figure 1-6

Figure 1-6 The sofa as the Kinect infrared sensor sees it.

The Kinect infrared sensor sees the sofa as a large number of tiny dots. The Kinect sensor constantly projects these dots over the area in its view. If you want to view the dots yourself, it’s actually very easy; all you need is a video camera or camcorder that has a night vision mode. A camera in night vision mode is sensitive to the infrared light spectrum that the Kinect distance sensor uses.

Figure 1-6, for example, was taken in complete darkness, with the sofa lit only by the Kinect. The infrared sensor in the Kinect is fitted with a filter that keeps out ordinary light, which is how it can see just the infrared dots, even in a brightly lit room. The dots are arranged in a pseudo-random pattern that is hardwired into the sensor. You can see some of the pattern in Figure 1-7.

Figure 1-7

Figure 1-7 The dot pattern on the sofa arm.

A pseudo-random sequence is one that appears to be random, but it is actually mechanically generated and easy to repeat. What’s important to remember here is that the Kinect sensor “knows” what the pattern looks like and how it is drawn. It can then compare the image from the camera with the pattern it knows it is displaying, and can use the difference between the two to calculate the distance of each point from the sensor.

To understand how the Kinect does this, you can perform a simple experiment involving a darkened room, a piece of paper, a flashlight, and a helpful friend. You need to adjust the flashlight beam so it’s tightly focused and makes a small spot. Now, get your friend to stand about 5 feet (1.5 meters) away from you, slightly to your right. Ask your friend to hold the paper to the front of you, holding the torch in your left hand, shine the torch dot onto the piece of paper. Now ask your friend to move forward toward you. As the person comes closer, you will see that the dot on the paper moves a little to the left because it now hits the paper before it has traveled quite as far to the right.

Figure 1-8 shows how this works. If you know the place you are aiming the dot, you can work out how far away your friend is by the position of the dot on the paper. The impressive thing about the Kinect sensor is that it performs that calculation for thousands of dots, many times a second. The infrared camera in the Kinect allows it to “see” where the dot appears in the image. Because the software knows the pattern that the infrared transmitter is drawing, the hardware inside the Kinect does all the calculations that are required to produce the “depth image” of the scene that is sent to the computer or Xbox.

Figure 1-8

Figure 1-8 Showing how the Kinect distance sensor works.

This technique is interesting because it is completely different from the way that humans see distance. Each human eye gets a slightly different view of a scene, which means that the closer an object is to a human, the greater the difference between the images seen by each eye. The brain identifies the objects in the scene, determines how much difference there is between the image from each eye, and then assigns a distance value to each object.

In contrast, the Kinect sensor shines a tightly focused spot of light on points in the scene and then works out how far away that point is from the sensor by analyzing the spot’s reflection. The Kinect itself doesn’t identify any objects in a scene; that task is performed by software in an Xbox or computer, as you’ll see later.

The Kinect Microphones

The Kinect sensor also contains four microphones arranged along the bottom of the bar. You can see them in Figure 1-2: two on the left and right ends, and two more on the right side of the unit. The Kinect uses these microphones to help determine from where in a room a particular voice is coming. This works because sound takes time to travel through air. Sound travels much more slowly than light, which is why you often hear a thunderclap long after seeing the corresponding bolt of lightning.

When you speak to the Kinect sensor, your voice will arrive at each microphone at different times, because each microphone is a slightly different distance away from the sound source. Software can then extract your voice waveform from the sound signal produced by each microphone and—using the timing information—calculate where the sound source is in the room. If several people are in a room with the Kinect, it can even work out which person is talking by calculating the direction from which their voice is coming, and can then “direct” the microphone array to listen to that area of the room. It can then remove “unwanted” sounds from that signal to make it easier to understand the speech content.

From a control point of view, when a program knows where the speech is coming from (perhaps by using the distance sensor), it can direct the microphone array in that direction, essentially creating a software version of the directional microphones that are physically pointed at actors to record their voices when filming motion pictures.