Visible speech enhanced: What do gestures and lip movements contribute to degraded speech comprehension?
Face-to-face communication involves an audiovisual binding that integrates information from multiple inputs such as speech, lip movements, and iconic gestures. Previous research showed that these kinds of visual inputs can enhance speech comprehension, especially in adverse listening conditions. However, the contribution of lip movements and iconic gestures to understanding speech in noise has been mostly studied separately. The current study aimed to investigate the contribution of both types of visual information to degraded speech comprehension in a joint context.
In Experiment 1, we investigated the contribution of iconic gestures and lip movements to degraded speech comprehension in four auditory conditions (clear speech, 16-band, 10-band and 6-band noise-vocoding) to determine the noise level where these visual inputs enhance degraded speech comprehension the most. Participants were presented with video clips (speech/lips or speech/lips/gesture), of an actress uttering a Dutch action verb, followed by a cued-recall task. This cued-recall task included the target verb, a semantic competitor, a phonological competitor and an unrelated distractor. Over all noise-vocoding levels, visual input significantly enhanced degraded speech comprehension. This enhancement was largest at 6-band noise-vocoding, as indicated by the largest difference in response accuracy in speech/lips/gesture vs. speech/lips trials. In addition, the error analyses revealed that information from lip movements was used for phonological disambiguation, whereas gestural information was used for semantic disambiguation.
In Experiment 2, we investigated the individual contributions of lip movements and iconic gestures to this audiovisual enhancement. Participants watched videos in 3 speech conditions (2-band noise-vocoding, 6-band noise-vocoding, clear speech), 3 visual conditions (speech/lips blurred, speech/lips visible, speech/lips/gesture) and 2 non-audio conditions (lips only and lips/gesture), to understand how much information participants could get from visual input alone. Participants showed significantly higher response accuracy for speech/lips/gesture conditions over speech/lips and speech/lips blurred conditions, over all noise-vocoding levels, and in lips/gesture videos compared to lips only. Additionally, the difference between speech/lips/gesture and speech/lips was significantly larger for 6-band noise-vocoding compared to 2-band noise-vocoding and compared to the difference between the two non-audio conditions. However, there was no difference between the two non-audio conditions vs. speech/lips/gesture and sleeps/lips at a 2-band noise-vocoding level.
Our results indicate that when degraded speech is processed in a visual context, listeners benefit significantly more from gestural information than from just lip movements alone, especially when auditory cues are moderately reliable (6-band noise-vocoding) compared to listening situations where auditory cues are no longer reliable (2-band noise-vocoding).