Investigating how the brain learns to process speech

Presented at the 19th Conference of the Hellenic Neuroscience Society. Patras, 30 September–2 October 2005

Protopapas, A.
Institute for Language & Speech Processing / "Athena"

Speech is a complex auditory signal carrying linguistic, paralinguistic, and extralinguistic information. The details of the sound depend on fixed and transient characteristics of the speaker, on the transmission properties, and on the content of intended linguistic message. Classes of sound that correspond to similar articulations are typically considered to constitute “phonetic categories.” The meaning-bearing units of languages are analyzed in terms of phonological categories that abstract from the articulatory and acoustic phonetic categories on the basis of semantic distinctions. Since languages differ in their vocabulary and in their phonetic repertoire, the brain must learn all of these relations during development. Moreover, learning a second language means that additional structured spaces of sounds and meanings are to be formed. How does the brain process the sound of speech when using language to communicate and when learning a new language?

In a series of studies we have examined the response of brain temporal envelopes and regional activation to time-compressed speech signals. Speech can be understood when sped up substantially, up to a point. Using MEG recordings it was observed that speech comprehension is related to the entrainment (signal-following response) of the cortex to the syllable-level modulation of the acoustic signal. In fMRI we saw that left hemisphere areas often associated with linguistic functions exhibited a curvilinear response to compression, indicating that overall activity matches the demands of increased signal rate as long as comprehension remains possible. Far from a static view of the brain as a passive processor of incoming information, these findings point to a brain that seeks and adapts to signals of importance.

In agreement with production models, the level of syllabic organization appears to be central in perception, despite the linguistic focus on the individual segments. Because the acoustic properties of the actual sounds that convey the purported segments varies greatly with syllabic position, it is thought that the focus on syllables may facilitate the task of perception models in accounting for communicative behavior. Still, the variability issues are far from solved, since syllables also sound quite differently depending on who utters them, in what conditions and situation, and in what manner. Experimental data have made it clear that when exposed to speech stimuli we seem to store much nonlinguistic auditory information present in the signals. For example, listening to words spoken by a set of talkers results not only in future identification of these talkers’ voices, but also in easier recognition of the same words and of new words uttered by the same talkers as compared with the same words spoken by unknown voices.

The view of speech processing in which rich information about the signal is retained stands in contrast to older theories of phonetic perception in which “normalization” was assumed to operate stripping the nonlinguistic features off the signal. This view is, however, consonant with viewpoint-dependent object recognition and exemplar-based category models, which reflect progress in the fields of vision and categorization over the last decades.

To examine the role of acoustic variability in speech processing when learning new sounds, we have trained Japanese listeners to distinguish between the English language sounds /r/ and /l/, a distinction normally very difficult for them. We have found that training must include variability along all relevant dimensions, otherwise learning does not generalize. For example, learning to distinguish /r/ from /l/ in word beginnings does not transfer to distinctions in consonant clusters, or in word endings. As others had previously found poor generalization to untrained voices, we used a set of training voices, which resulted in successful generalization to new ones. These observations from learning strengthen the exemplar-based view of phonetic categorization and necessitate new models of linguistic representation at the levels of phonology and below.

One crucial aspect of speech learning concerns the brain systems that are responsible for forming the new representations. Traditional training studies have used explicit training with feedback, in which participants hear a sound and must indicate which type it is. This task is highly unnatural and this may contribute to the poor transfer of the learned skill in the context of language use. Recently, it was shown that implicit learning of nonspeech acoustic categories is possible; and that learning of unattended (even subliminal) features of visual stimuli is possible as long as they correlate with an attended task. We hypothesize that the unattended features are associated with the attended task because they act as predictors in a model akin to classical conditioning, in which reward contingencies drive behavioral learning. In such a model, internally generated neuromodulatory signals enhance processing of contextual features. On the other hand, our previous training studies showed that performance feedback is not necessary for learning in a situation of focused attention, therefore attention itself can apparently provide alternative learning signals, in agreement with more mainstream attention-based models of learning and categorization. This differentiation is consistent with brain stimulation studies in animal models and our emerging understanding of the conditions in which different neuromodulator systems become active.

In studies currently underway we investigate the formation of phonetic categories under different conditions of relevance, in order to uncover the contributions of potentially different learning systems and their implications for generalizing learned categories to language use. We are also examining the role of internally generated maps on nonspeech category learning, aiming to understand the relation between articulation and perception and the extent to which this relation is unique to speech, perhaps as an evolutionary adaptation, or merely an effect of the regular mapping between the two domains that develops in infancy.

References
Ahissar M, Hochstein S. Trends Cog Sci 8: 457-464, 2004.
Ahissar E, Nagarajan S, Ahissar M, Protopapas A, Merzenich MM. Proc Nat Acad Sci 98: 13367–13372, 2001.
Bao S, Chan VT, Merzenich MM. Nature 412: 79-83, 2001.
Diehl R, Lotto A, Holt LL. Annu Rev Psych 55: 149-179, 2004.
Kilgard MP, Merzenich MM. Science 279:1714-1718, 1998.
Kruschke JK. In K. Lamberts and R. L. Goldstone (Eds.), The Handbook of Cognition, Ch. 7, pp. 183-201. Sage, 2005.
Lively S, Logan J, Pisoni D. J Acoust Soc Am 94: 1242–1255, 1993.
McCandliss BD, Fiez JA, Protopapas A, Conway M, McClelland JL. Cogn Affect Beh Neurosci 2: 89–108, 2002.
Palmeri TJ, Gauthier I. Nat Rev Neurosci 5: 291-304, 2004.
Palmeri TJ, Goldinger SD, Pisoni DB. J Exp Psychol Learn Mem Cog 19: 309-328, 1993.
Poldrack RA, Temple E, Protopapas A, Nagarajan S, Tallal P, Merzenich MM, Gabrieli JDE. J Cog Neurosci 13: 687–697, 2001.
Seitz AR, Watanabe T. Trends Cog Sci (in press).
Seitz AR, Watanabe T. Nature 422: 36, 2003.
Wade T, Holt LL. J Acoust Soc Am (in press).
Watanabe T, N??ez JA, Sasaki Y. Nature 413: 844-848, 2001.