
To avoid that problem, many algorithms have been developed. However, the desired signal is also canceled in reverberant environments, which is known as the signal cancellation problem. The conventional beamforming techniques can effectively place a null on any source of interference. Accordingly, it is considered that the conventional beamformer uses second orderstatistics (SOS) of the beamformer's outputs. The variance is the average of the second power (square) of the beamformer's outputs. Typical conventional beamformers then adjust their weights so as to minimize the variance of their own outputs subject to a distortionless constraint in a look direction. In beamforming, multiple signals propagating from a position are captured with multiple microphones. Acoustic beamforming techniques have a potential to enhance speech from the far field with little distortion since they can maintain a distortionless constraint for a look direction. In order to avoid the degradation, noise and reverberation should be removed from signals received with the microphones. A main problem in DSR is that recognition performance is seriously degraded when a speaker is far from the microphones. DSR systems are useful in many applications such as humanoid robots, voice control systems for automobiles, automatic meeting transcription systems and so on. Such techniques can relieve users from the necessity of putting on close talking microphones. This dissertation presents novel beamforming methods for distant speech recognition (DSR). Index Terms: speech processing, speech recognition, speaker diarization, speech activity detection, lectures, smart rooms. Initial results are also pre- sented on SASTT, a task newly introduced in 2007 in place of the discontinued SAD. For their development, the systems are benchmarked on a subset of the RT Spring 2006 (RT06s) evaluation test set, where they yield significant improvements for all SAD, SPKR, and STT tasks over RT06s results for example, a 16% relative reduction in word error rate is reported in STT, attributed to a number of system advances discussed here. In this paper, we present the IBM systems developed to address these tasks in preparation for the RT 2007 evaluation, focusing on the far-field condition of lecture data collected as part of European project CHIL. In particular, the topic has been central to the Rich Transcription (RT) Meeting Recognition Evaluation campaign series, sponsored by NIST, with empha- sis placed on benchmarking speech activity detection (SAD), speaker diarization (SPKR), speech-to-text (STT), and speaker- attributed STT (SASTT) technologies.

Speech processing of lectures recorded inside smart rooms has recently attracted much interest. In the NIST RT06s evaluation campaign, both MDM and SDM systems scored well, however the IHM system did poorly due to unsuccessful cross-talk removal. Furthermore, the developed STT system significantly outperformed our last year's results, by reducing close-talking microphone data WER from 36.9% to 25.4% on our development set. On development data, chosen in our work to be the 2005 CHIL-internal STT evaluation test set, the resulting language model provided a 4% absolute gain in word error rate (WER), compared to the model used in last year's CHIL evaluation. For language modeling, we utilized meeting transcripts, text from scientific conference proceed- ings, and spontaneous telephone conversations. Due to the relatively small amount of CHIL-domain data, the acoustic models of our systems are built on publicly available meeting corpora, with maximum a-posteriori adaptation applied twice on CHIL data during training: First, at the initial speaker-independent model, and subsequently at the minimum phone error model.

Instead, feature-space minimum-phone error discriminative training yielded the best results.

However, the best models for the far- field conditions (SDM and MDM) proved to be the ones that use neither variance normalization nor vocal tract length normalization. The system building process is similar to the IBM conversational tele- phone speech recognition system. We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign on the CHIL lecture meet- ing data for three conditions: Multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM).
