autodiscover.cmnv.org/las-aventuras-de-lucas.php For example, if the location information is extracted and is available to the C-matrices analogous to the pitch-grams , then they can be exploited in parallel with, and in a manner exactly analogous to the pitch. Temporal coherence can similarly help segregate speech using co-modulated signals of other modalities as in lip-reading as demonstrated later. B The complexes are segregated using all spectral and pitch channels. Closely spaced harmonics , Hz mutually interact and hence their channels are only partially correlated with the remaining harmonics, becoming weak or may even vanish in the segregated streams.
Speech mixtures share many of the same characteristics already seen in the examples of Fig. For instance, they contain harmonic complexes with different pitches e. Speech also possesses other features such as broad bursts of noise immediately followed or preceded by voiced segments as in various consonant-vowel combinations , or even accompanied by voicing voiced consonants and fricatives. In all these cases, the syllabic onsets of one speaker synchronize a host of channels driven by the harmonics of the voicing, and that are desynchronized or uncorrelated with the channels driven by the other speaker.
The pitch tracks associated with each of these panels are shown below them. A Mixture of two sample utterances left panel spoken by a female middle panel and male right panel ; pitch tracks of the utterances are shown below each panel. B The segregated speech using all C-matrix columns.
C The segregated speech using only coincidences among the frequency-scale channels no pitch information. D The segregated speech using the channels surrounding the pitch channels of the female speaker as the anchor. The reconstructed spectrograms are not identical to the originals Fig. Nevertheless, with two speakers there are sufficient gaps between the syllables of each speaker to provide clean, unmasked views of the other speaker's signal .
If more speakers are added to the mix, such gaps become sparser and the amount of energetic masking increases, and that is why it is harder to segregate one speaker in a crowd if they are not distinguished by unique features or a louder signal. An interesting aspect of speech is that the relative amplitudes of its harmonics vary widely over time reflecting the changing formants of different phonemes.
Kale, S. Mohideen, and V. Uppenkamp and J. First, the effect of changing the morphology of the synaptic ribbon is assessed and it was found that the dynamics of the system are affected by the ribbon shape, by increasing the rate of vesicle release. Figure 2.
Consequently, the saliency of the harmonic components changes continually, with weaker ones dropping out of the mixture as they become completely masked by the stronger components. Despite these changes, speech syllables of one speaker maintain a stable representation of a sufficient number of features from one time instant to the next, and thus can maintain the continuity of their stream.
This is especially true of the pitch which changes only slowly and relatively little during normal speech. The same is true of the spectral region of maximum energy which reflects the average formant locations of a given speaker, reflecting partially the timbre and length of their vocal tract. Humans utilize either of these cues alone or in conjunction with additional cues to segregate mixtures. For instance, to segregate speech with overlapping pitch ranges a mixture of male speakers , one may rely on the different spectral envelopes timbres , or on other potentially different features such as location or loudness.
Humans can also exploit more complex factors such as higher-level linguistic knowledge and memory as we discuss later. In the example of Fig. The extracted speech streams of the two speakers resemble the original unmixed signals, and their reconstructions exhibit significantly less mutual interference than the mixture as quantified later.
Exactly the same logic can be applied to any auxiliary function that is co-modulated in the same manner as the rest of the speech signal.
These two functions inter-lip distance and the acoustic envelope can then be exploited to segregate the target speech much as with the pitch channels earlier. Thus, by simply computing the correlation between the lip function Fig. This example thus illustrates how in general any other co-modulated features of the speech signal e.
The performance of the model is quantified with a database of mixtures formed from pairs of male-female speech randomly sampled from the TIMIT database Fig. The signal-to-noise ratio is computed as 1 2. B Top Notation used for coincidence measures computed between the original and segregated sentences plotted in panels below.
Middle Distribution of coincidence in the cortical domain between each segregated speech and its corresponding original version violet and original interferer magenta. Bottom Scatter plot of difference between correlation of original sentences with each segregated sentence demonstrates that the two segregated sentences correlate well with different original sentences. Another way to demonstrate the effectiveness of the segregation is to compare the match between the segregated samples and their corresponding originals. This is evidenced by the minimal overlap in Fig.
To compare directly these coincidences for each pair of mixed sentences, the difference between coincidences in each mixture are scatter-plotted in the bottom panel. Effective pairwise segregation e.
Examples of segregated and reconstructed audio files can be found in S1 Dataset. In principle, segregating mixtures does not depend on them being speech or music, but rather that the signals have different spectrotemporal patterns and exhibit a continuity of features. The same speech sample is extracted from a mixture with music in Fig.
A Speech mixed with street noise of many overlapping spectral peaks left panel. The two signals are uncorrelated and hence can be readily segregated and the speech reconstructed right panel. B Extraction of speech right panel from a mixture of speech and a sustained oboe melody left panel. So far, attention and memory have played no direct role in the segregation, but adding them is relatively straightforward.
From a computational point of view, attention can be interpreted as a focus directed to one or a few features or feature subspaces of the cortical model which enhances their amplitudes relative to other unattended features. For instance, in segregating speech mixtures, one might choose to attend specifically to the high female pitch in a group of male speakers Fig. In these cases, only the appropriate subset of columns of the C-matrices are needed to compute the nPCA decomposition Fig.
This is in fact also the interpretation of the simulations discussed in Fig. In all these cases, the segregation exploited only the C-matrix columns marking coincidences of the attended anchor channels pitch, lip, loudness with the remaining channels. Memory can also be strongly implicated in stream segregation in that it constitutes priors about the sources which can be effectively utilized to process the C-matrices and perform the segregation.
For example, in extracting the melody of the violins in a large orchestra, it is necessary to know first what the timbre of a violin is before one can turn the attentional focus to its unique spectral shape features and pitch range. A biologically plausible model of auditory cortical processing can be used to implement the perceptual organization of auditory scenes into distinct auditory objects streams. Two key ingredients are essential: 1 a multidimensional cortical representation of sound that explicitly encodes various acoustic features along which streaming can be induced; 2 clustering of the temporally coherent features into different streams.
Temporal coherence is quantified by the coincidence between all pairs of cortical channels, slowly integrated at cortical time-scales as described in Fig. An auto-encoder network mimicking Hebbian synaptic rules implements the clustering through nonlinear PCA to segregate the sound mixture into a foreground and a background.
The temporal coherence model segregates novel sounds based exclusively on the ongoing temporal coherence of their perceptual attributes. Previous efforts at exploiting explicitly or implicitly the correlations among stimulus features differed fundamentally in the details of their implementation. For example, some algorithms attempted to decompose directly the channels of the spectrogram representations  rather than the more distributed multi-scale cortical representations.
They either used the fast phase-locked responses available in the early auditory system  , or relied exclusively on the pitch-rate responses induced by interactions among the unresolved harmonics of a voiced sound . The cortical model instead naturally exploits multi-scale dynamics and spectral analyses to define the structure of all these computations as well as their parameters.
For instance, the product of the wavelet coefficients entries of the C-matrices naturally compute the running-coincidence between the channel pairs, integrated over a time-interval determined by the time-constants of the cortical rate-filters Fig. This insures that all coincidences are integrated over time intervals that are commensurate with the dynamics of the underlying signals and that a balanced range of these windows are included to process slowly varying 2 Hz up to rapidly changing 16 Hz features.
The biological plausibility of this model rests on physiological and anatomical support for the two postulates of the model: a cortical multidimensional representation of sound and coherence-dependent computations. The cortical representation is the end-result of a sequence of transformations in the early and central auditory system with experimental support discussed in detail in .
The version used here incorporates only a frequency tonotopic axis, spectrotemporal analysis scales and rates , and pitch analysis . However, other features that are pre-cortically extracted can be readily added as inputs to the model such as spatial location from interaural differences and elevation cues and pitch of unresolved harmonics .
The second postulate concerns the crucial role of temporal coherence in streaming. It is a relatively recent hypothesis and hence direct tests remain scant. Nevertheless, targeted psychoacoustic studies have already provided perceptual support of the idea that coherence of stimulus-features is necessary for perception of streams  ,  ,  , . Parallel physiological experiments have also demonstrated that coherence is a critical ingredient in streaming and have provided indirect evidence of its mechanisms through rapidly adapting cooperative and competitive interactions between coherent and incoherent responses  , .
Nevertheless, much more remains uncertain. For instance, where are these computations performed? How exactly are the auto-encoder clustering analyses implemented? And what exactly is the role of attentive listening versus pre-attentive processing in facilitating the various computations? All these uncertainties, however, invoke coincidence-based computations and adaptive mechanisms that have been widely studied or postulated such as coincidence detection and Hebbian associations  , .
Dimensionality-reduction of the coincidence matrix through nonlinear PCA allows us effectively to cluster all correlated channels apart from others, thus grouping and designating them as belonging to distinct sources. This view bears a close relationship to the predictive clustering-based algorithm by  in which input feature vectors are gradually clustered or routed into distinct streams.
In both the coherence and clustering algorithms, cortical dynamics play a crucial role in integrating incoming data into the appropriate streams, and therefore are expected to exhibit for the most part similar results. In some sense, the distinction between the two approaches is one of implementation rather than fundamental concepts. Clustering patterns and reducing their features are often but not always two sides of the same coin, and can be shown under certain conditions to be largely equivalent and yield similar clusters .
Nevertheless, from a biological perspective, it is important to adopt the correlation view as it suggests concrete mechanisms to explore. Our emphasis thus far has been on demonstrating the ability of the model to perform unsupervised automatic source segregation, much like a listener that has no specific objectives. In reality, of course, humans and animals utilize intentions and attention to selectively segregate one source as the foreground against the remaining background.
This operational mode would similarly apply in applications in which the user of a technology identifies a target voice to enhance and isolate from among several based on the pitch, timbre, location, or other attributes. The temporal coherence algorithm can be readily and gracefully adapted to incorporate such information and task objectives, as when specific subsets of the C-matrix columns are used to segregate a targeted stream e. In fact, our experience with the model suggests that segregation is usually of better quality and faster to compute with attentional priors.
In summary, we have described a model for segregating complex sound mixtures based on the temporal coherence principle. The model computes the coincidence of multi-scale cortical features and clusters the coherent responses as emanating from one source. It requires no prior information, statistics, or knowledge of source properties, but can gracefully incorporate them along with cognitive influences such as attention to, or memory of specific attributes of a target source to segregate it from its background.
The model provides a testable framework of the physiological bases and psychophysical manifestations of this remarkable ability. Finally, the relevance of these ideas transcends the auditory modality to elucidate the robust visual perception of cluttered scenes  , . Sound is first transformed into its auditory spectrogram, followed by a cortical spectrotemporal analysis of the modulations of the spectrogram Fig.
Pitch is an additional perceptual attribute that is derived from the resolved low-order harmonics and used in the model . Other perceptual attributes such as location and unresolved harmonic pitch can also be computed and represented by an array of channels analogously to the pitch estimates.
The auditory spectrogram, denoted by , is generated by a model of early auditory processing  , which begins with an affine wavelet transform of the acoustic signal, followed by nonlinear rectification and compression, and lateral inhibition to sharpen features. Cortical spectro-temporal analysis of the spectrogram is effectively performed in two steps  : a spectral wavelet decomposition followed by a temporal wavelet decomposition, as depicted in Fig.
The first analysis provides multi-scale multi-bandwidth views of each spectral slice , resulting in a 2D frequency-scale representation. It is implemented by convolving the spectral slice with complex-valued spectral receptive fields similar to Gabor functions, parametrized by spectral tuning , i. The outcome of this step is an array of F x S frequency-scale channels indexed by frequency and local spectral bandwidth at each time instant t.
In addition, the pitch of each spectrogram frame is also computed if desired using a harmonic template-matching algorithm . The chapters, rather than focusing on details of individual components of the hearing system, address the concerns of readers and researchers who want to know how the auditory system works as a whole. Enrique A. Arthur N. Richard R. The Springer Handbook of Auditory Research presents a series of synthetic reviews of fundamental topics dealing with auditory systems. Each volume is independent and authoritative; taken as a set, this series is the definitive resource in the field.
Skip to main content Skip to table of contents. Advertisement Hide. Computational Models of the Auditory System. Lopez-Poveda Richard R. Fay Arthur N. Front Matter Pages i-xii. Pages Models of the Superior Olivary Complex.
Computational Models of Inferior Colliculus Neurons. Kevin A. Davis, Kenneth E.