Are deep neural network speech recognizers still hearing-impaired?
Previous comparisons of human speech recognition (HSR) and automatic speech recognition (ASR) focused on monaural signals in additive noise, and showed that HSR is far more robust against intrinsic and extrinsic sources of variation than conventional ASR. The difference in performance between normal-hearing people and ASR was in about the same order of magnitude than the difference between normal-hearing (NH) and hearing-impaired (HI) listeners (listening experiments were performed during the HearCom project, www.hearcom.eu), leading to the saying that “ASR systems are hearing-impaired”. Recent developments in ASR, especially the use of deep neural networks (DNNs), showed large improvements in ASR performance compared to standard recognizers based on Gaussian Mixture Models (GMMs). The aim of this study is (A) to compare recognition performance of NH and HI listeners in monaural conditions with different noise types with state-of-the-art ASR systems using DNN/HMM and GMM/HMM architectures and (B) to analyze the man-machine gap (and its causes) in more complex acoustic scenarios, particularly in scenes with two moving speakers and diffuse noise. The overall man-machine gap is measured in terms for the speech recognition threshold (SRT), i.e., the signal-to-noise ratio at which a 50% recognition rate is obtained. For both scenarios we also investigate the effect of auditory features on the performance of ASR systems and measure the similarity between different ASR systems and NH listeners.
For (A) we compare responses of 10 normal-hearing listeners to different ASR systems with the identical speech material utilizing the Aurora2 speech recognition framework. Besides, compare data collected from normal-hearing and hearing-impared listeners during the HearCom project using the Oldenburg sentence test (OlSa) with additive stationary noise. Results show that state of the art ASR systems can reach performance of normal-hearing listeners in terms of SRT under certain conditions.
For (B) responses of nine normal-hearing listeners are compared to errors of an ASR system that employs a binaural model for direction-of-arrival estimation and beamforming for signal enhancement. The comparison shows that the gap amounts to 16.7 dB SRT difference which exceeds the difference of 10 dB found in monaural situations. Based on cross comparisons that use oracle knowledge (e.g., the speakers’ true position), incorrect responses are attributed to localization errors (7 dB) or missing spectral information to distinguish between speakers with different gender (3 dB). The comparison hence identifies specific ASR components that can profit from learning from binaural auditory signal processing.