How do you design a computer vision algorithm that is able to detect and segment people when they are captured by a visible light camera, a thermal infrared camera, and a depth sensor? And how do you fuse the three inherently different data streams such that you can reliably transfer features from one modality to another? Feel free to download our dataset and try it out yourselves!

The dataset features a total of 5724 annotated frames divided in three indoor scenes.
Activity in scene 1 and 3 is using the full depth range of the Kinect for XBOX 360 sensor whereas activity in scene 2 is constrained to a depth range of plus/minus 0.250 m in order to suppress the parallax between the two physical sensors. Scene 1 and 2 are situated in a closed meeting room with little natural light to disturb the depth sensing, whereas scene 3 is situated in an area with wide windows and a substantial amount of sunlight. For each scene, a total of three persons are interacting, reading, walking, sitting, reading, etc.

Every person is annotated with a unique ID in the scene on a pixel-level in the RGB modality. For the thermal and depth modalities, annotations are transferred from the RGB images using a registration algorithm found in registrator.cpp.

We have used our AAU VAP Multimodal Pixel Annotator to create the ground-truth, pixel-based masks for all three modalities.
Date made available1 Jan 2017
Date of data production2013

Cite this