Cortical networks for recognition of speech with simultaneous talkers

Hear Res. 2023 Sep 15:437:108856. doi: 10.1016/j.heares.2023.108856. Epub 2023 Jul 24.

Abstract

The relative contributions of superior temporal vs. inferior frontal and parietal networks to recognition of speech in a background of competing speech remain unclear, although the contributions themselves are well established. Here, we use fMRI with spectrotemporal modulation transfer function (ST-MTF) modeling to examine the speech information represented in temporal vs. frontoparietal networks for two speech recognition tasks with and without a competing talker. Specifically, 31 listeners completed two versions of a three-alternative forced choice competing speech task: "Unison" and "Competing", in which a female (target) and a male (competing) talker uttered identical or different phrases, respectively. Spectrotemporal modulation filtering (i.e., acoustic distortion) was applied to the two-talker mixtures and ST-MTF models were generated to predict brain activation from differences in spectrotemporal-modulation distortion on each trial. Three cortical networks were identified based on differential patterns of ST-MTF predictions and the resultant ST-MTF weights across conditions (Unison, Competing): a bilateral superior temporal (S-T) network, a frontoparietal (F-P) network, and a network distributed across cortical midline regions and the angular gyrus (M-AG). The S-T network and the M-AG network responded primarily to spectrotemporal cues associated with speech intelligibility, regardless of condition, but the S-T network responded to a greater range of temporal modulations suggesting a more acoustically driven response. The F-P network responded to the absence of intelligibility-related cues in both conditions, but also to the absence (presence) of target-talker (competing-talker) vocal pitch in the Competing condition, suggesting a generalized response to signal degradation. Task performance was best predicted by activation in the S-T and F-P networks, but in opposite directions (S-T: more activation = better performance; F-P: vice versa). Moreover, S-T network predictions were entirely ST-MTF mediated while F-P network predictions were ST-MTF mediated only in the Unison condition, suggesting an influence from non-acoustic sources (e.g., informational masking) in the Competing condition. Activation in the M-AG network was weakly positively correlated with performance and this relation was entirely superseded by those in the S-T and F-P networks. Regarding contributions to speech recognition, we conclude: (a) superior temporal regions play a bottom-up, perceptual role that is not qualitatively dependent on the presence of competing speech; (b) frontoparietal regions play a top-down role that is modulated by competing speech and scales with listening effort; and (c) performance ultimately relies on dynamic interactions between these networks, with ancillary contributions from networks not involved in speech processing per se (e.g., the M-AG network).

Keywords: Auditory; Competing speech; Sensorimotor; Spectrotemporal modulations; Speech recognition; fMRI.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Acoustics
  • Cognition
  • Cues
  • Female
  • Humans
  • Male
  • Perceptual Masking / physiology
  • Speech Intelligibility
  • Speech Perception* / physiology
  • Speech*