Several emerging applications operate on human speech. A few examples are smart homes, systems for automated camera steering, and surveillance systems. All of these requires that the position of the speaker in relation to an array of microphones is known, which is most often not the case in practice. It is therefore necessary to estimate the position of the speaker. However, this is problematic in practice, due to phenomena such as reverberation, background noise, wrong microphone array calibration, and interfering sources. These phenomena appears in almost every practical scenario, complicating or even precluding the estimation of the speaker location. This project therefore tackles the estimation problem in a novel way, where visual information about the speaker obtained using one or more cameras is used jointly with microphone recordings for speaker localization. This procedure is beneficial as the audio and visual information are complementary, as many of the aforementioned phenomena do not appear in camera recordings. The robust estimates obtained in this way, will help in improving the performance of the initially listed applications.