Source Separation via Spectral Masking for Speech Recognition Systems

In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise). The main contribution of this work is to analyze the performance limits of recognition systems using spectral masking. We obtain an increase of 18 % in the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results in an average growth of 9 % in the speech hit rate, with the same differents signal-to-noise ratios. The experimental results suggest that the spectral masking techniques are more appropriate when applied to bubble noise, which is produced by human speech, than to white, pink and brown noise. Keywords—Blind source separation, Independent component analysis, Neural networks, Spectral masking, Speech recognition.


I. INTRODUCTION
When several people are talking at the same time in a meeting or public places, it is necessary to separate the voice of a given person or an specific source from other interference sources so that each speaker can be recognized.Independent components analysis has been an important source separation technique, however, with the presence of noise and reverberation, the separated signals have strong residual components of other interference sources [1].In these cases, a signal preprocessing method must be used in order to reduce other sources of interference.Our goal is to show the potential improvement of automatic speech recognition in noisy environments or with multiple speech signals.In this paper, we show that spectral masking techniques, used as preprocessing tools, reduce other sources of interference and increase the efficiency of the speech recognition systems.
Several works use observation vectors of uncertainties in the decoding process for the treatment of noisy signals in the automatic speech recognition task [2]- [4].When dealing with G. F. Rodrigues, T. S. Siqueira and A. C. Souza are Department of Telecommunications and Mechatronic Engineering, Federal University of São João del-Rey, Ouro Branco, Minas Gerais, Brazil (corresponding author to provide phone and fax: +55 31 37413583, e-mail: gustavofernandes@gmail.com).
H. C. Yehia is with Department of Electronic Engineering, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.
speech recognition in environments with several speakers, such as the ones in [5]- [7], some authors suggest the use of binary masking [8], [9].However, when speech signals are exposed to environments with reverberation and in the presence of other speakers, the process of extracting the masks becomes extremely difficult.This paper intends to quantify and analyze the efficiency of the spectral masking in speech recognition tasks.
The majority of speech recognition systems does not show good performance in noisy environments or when there are interferences from other voices.Therefore, we intend to improve the efficiency of speech recognition systems through the implementation of a source separation method using spectral masking in time-frequency domain, as a preprocessing stage.Time-frequency domain masking is used to extract an specific speech signal from the noise corrupted signal [10], [11].The mel-cepstral parameters are used in the speech recognition step to provide the input data to the speech recognition system.Some papers show that binary masking provides extracting information in time-frequency domain which best characterizes speech signal [11].The necessity to improve the speech recognition system's performance in environments under adverse conditions and multiple speakers has attracted researchers attention and many papers about separation of speech signals have been published [1], [4]- [6], [9], [10]- [14].
This paper is organized as follows.In Section II we discuss source separation techniques via spectral masking.In Section III we describe the steps to obtain the speech signals parameters and discuss the implementation of the speech recognition system.In Section IV we show the results of the experiments and simulations done to verify the influence of noise and other voices in speech recognition tasks.The tests made to analyze the limits and improvement capacity of the speech recognition systems through spectral masking as well as the analysis of the results obtained are detailed in Section IV.Finally, Section V outline the conclusions of this work.

II. SPECTRAL MASKING IN TIME-FREQUENCY DOMAIN
An specific sound source can be recovered by applying a weighted mask to an acoustic mixture at each point in the time-frequency domain.The regions dominated by this source receive higher weights than the ones where other sound sources of the analyzed mixture prevails.The masks may be binary or assume real values.The use of a binary mask is motivated by the masking process which occurs in human audition, where a more intense sound can mask or obscure a less intense one within the same critical band.With the purpose of separating the voice signals, [15] proposed the usage of an ideal binary mask.Given a speech signal s(t, f ) and a noise n(t, f ), where t and f represent the instant in time and frequency, respectively, the ideal binary masking m(t, f ) can be obtained through the following expression: A similar approach was adopted in [16], who observed an orthogonal tendency in different voice signals in timefrequency domain with high resolution, showing that it is possible to separate signals through binary masking.Several papers have shown that the speech signal reconstructed by an ideal binary mask is intelligible when extracted from a mixture of two or more speakers [9], [17], [18].
The ideal binary mask considered here is a binary matrix which assumes value one when the signal energy is stronger than the interference signal for a specific frequency in a given instant and assumes value zero otherwise.When we apply an ideal mask to an instantaneous mixture of two speech signals, we notice that the signal obtained is perfectly audible, with good quality and no interference from the other signal.
In order to obtain the ideal mask, the source signals (s 1 and s 2 ) are transformed to time-frequency domain and their spectrograms are given, respectively, by: where w represents angular frequency and t represents the time instant of the voice frame being analyzed.The binary masking can be determined by comparing the magnitude of two spectrograms, as shown in Fig. 1.The ideal masks (M 1 and M 2 ) were obtained as follows: and the other values of the masks are equal to zero.Fig. 2 shows an example of ideal binary masks applied to two speech signals, sampled at 8000 Hz, divided in frames of 512 samples with 50% overlap.Spectral masking techniques can be applied to speech signal separation, specially when the mixture is corrupted by noise signals.When applying an ideal binary mask to separate two speech signals, it is possible to improve the signal-interference ratio over 40 dB (SIR), but it also increases the signaldistortion ratio (SDR) in 20 dB [1].However, an ideal binary mask can not be obtained without knowledge of the real signals and an approximation is needed.In order to obtain this approximation, several methods to estimate masks based on ICA (independent component analysis) can be found in the literature, both for binary and continuous masks [12], [14].
The signal-distortion ratio decreases about 3 dB when approximately 10% of the bits of the ideal binary mask are inverted, as shown in [19].An error of 10% of the bits in the binary masks estimate is acceptable without loss of intelligibility [19], [20].
The objective of this paper is to show the potential improvement of the speech recognition systems, in noisy environments or with multiple speech signals, using spectral masking techniques.We present a signal preprocessing method to reduce other sources of interference, using ideal binary masks.In this experiment, the speech signals or noise signals were known.This fact allows to obtain the ideal binary mask from each mixture analyzed.We use all the recordings to simulate mixtures corrupted by noise, and therefore to obtain the ideal masks.The ideal binary masks obtained were applied to a mixture corrupted by other speech signal or noise, sampled at 8000 Hz, divided in frames of 512 samples with 50% overlap.The ideal binary masks were used to separate the signal of interest (speech signal) from the noise, as a preprocessing step for the speech signal recognition.
The ideal masks obtained were directly applied to the speech signal mixtures in time-frequency domain, before the extraction of the mel-cepstral coefficients and the learning process of the neural network used in the speech recognition tests.The ideal masks were not applied in the training set used to train the neural network.

III. SPEECH RECOGNITION SYSTEM
Speech recognition systems have low performance in noisy environments or in the industry.In this work, we are concerned with real situations where speech signals are corrupted by noise, including other speech signals.A speaker dependent speech recognition system was used to recognize isolated voice commands from a limited vocabulary (30 Portuguese words) and each word was recorded 20 times by one speaker.The corpus consists of voice commands that can be used in automation and control systems .The Portuguese words used were: right, left, stop, go back, go on, ahead, behind, on, off, fast, slow, turn on, turn off, up, down, speed up, lock, unlock and alarm.Besides these words, the database consists of eleven digits: 0 to 10.The ideal binary masks were used to separate the target speech signals from a mixture with other speech signal or other types of noise.We use all the recordings to simulate mixtures corrupted by noise, and therefore to obtain the ideal masks.The ideal masks obtained were directly applied to the speech signal mixtures in timefrequency domain, before the extraction of the mel-cepstral coefficients and the learning process of the neural network used in the speech recognition system.

A. Extraction of Mel-Cepstral coefficients
The mel-cepstral parameters are the most used features in speech recognition systems as input data.A Mel is unity of measurement for the perceived pitch of a tone.It does not have a linear correspondence to the physic frequency.The Mel scale is defined by a mapping between the frequency scale (in Hz) and the perceived frequency scale (in Mel).The mapping is linear until approximately 1 KHz and logarithmic for superior frequencies.The frequency scale denominated Mel is closely related to the critical-band of the auditory human system and the mel-cepstral coefficients are obtained from the Mel frequency.
The extraction of the mel-cepstral coefficients consists of the following the steps: (i) to obtain the magnitude spectrum of the signal by applying Fourier transform; (ii) to calculate each output of a filter bank in the Mel scale as a sum of the weighted spectral magnitude of each frame; (iii) to obtain the logarithm of the magnitude spectrum in each filter output; (iv) to take the discrete cosine transform (DCT) from each frame.
The speech signals used were recorded with a sampling frequency of 8000 Hz.In the extraction step, the data were divided in frames of 512 samples with 50% overlap.For each frame, a Hamming window was applied followed by 13th order mel-cepstral coefficients.

B. Principal Components Analysis
Principal Components Analysis (PCA) consists of a linear transformation of "m" original variables in "m" new variables, in such a way that the first new variable accounts for as much of the variability in the data as possible and each succeeding component in turn has the highest variance possible under the constraint that it is uncorrelated with the preceding ones, until each variance in the set has been explained.The purpose of this technique in this case is to allow a reduction in the dimension of the data, therefore minimizing error.
For each frame we have used 13 mel-cepstral coefficients and therefore the data set was represented by a N x13 vector, where N corresponds to the number of frames in analysis.
where [.] T denotes a transpose matrix.The variable x n , 1 ≤ n ≤ N represents one frame of the audio signal being analyzed.The set of all frames is represented by the N -sized vectors: where each column of X denotes the 13 coefficients of each signal frame.From that, the covariance matrix is defined by: where µ is a mean vector.Then, through the decomposition of singular values (SVD) one can denote the covariance matrix as: where U is a matrix whose columns are eigenvectors of C and S is a diagonal matrix containing the respective eigenvalues of C.
The sum of the eigenvalues represents the total variance observed in C. Therefore, if the sum of the first k eigenvalues reaches a proportion, as the one considered, of 85% of the sum of all eigenvalues, then the first k eigenvectors of C will account for most of the total variance observed in the data set.In this paper we used the first three components as the input vector for the pattern classification by neural network.The dimensions of speech features were reduced to a vector of 39 values for each word.

C. Learning with neural network
A training corpus of 600 utterances was used from which half was used for training and the remaining for testing.The input speech was reduced to a vector of 39 values for each word from the mel-cepstral coefficients (the dimension was reduced by PCA).
Algorithms based on neural network of type multi-layer perceptron, using backpropagation algorithm to the supervised learning process has been used in voice recognition systems [21]- [24].The network used in this paper is a feedforward multilayer perceptron (M LP ) trained with stochastic back-propagation algorithm.This experiment uses three layer M LP : one input layer, one hidden layer and one output layer.The feature vectors representing speech pattern are fed into neural network at the input layer.Only 39 values (from melcepstral coefficients with the dimension reduced by pca) for each word are fed to the neural network.In this neural network there is a single hidden layer that has 30 neurons.The numbers of neurons in output layers is set to five.The output of the network is a binary value representing the recognized word.The hidden and output neurons are activated using sigmoidal and linear activation functions respectively.Once the network is created, it can be trained for a specific problem by presenting training inputs and their corresponding targets (supervised training).A set of 10 samples of each word (300 utterances) was used as training data and another part as test data.The binary masks were not used in the training set.The ideal masks were applied only in the test set used to test the neural network.

IV. RESULTS
The recordings were made in a laboratory with low noise level, with approximately zero reverberation time and sampling frequency of 8000 Hz.The hit rate (word recognition rate) was used as a measure of the speech recognition efficiency.Another common metric is the word error rate (WER).We decided to use the hit rate measure as the mirror representation of the word error rate.In order to obtain the speech hit rates, 100 simulations were made for each case being analyzed.Different simulations were done for testing, where the original speech signal was corrupted by other speech signals from different speakers and by different types of noise.In each case different levels of signal-to-noise ratio (1, 10, 20 and 30 dB) were considered.
The definition of noise is derived from a random signal, but it can have different characteristic statistical properties.In this paper, we analyzed the limits of spectral masking techniques for the following types of noise: white, pink, brown and a human speech noise (bubble noise).By definition, the white noise has a flat frequency spectrum.The Pink noise or "1/f noise" is a signal with power spectral density inversely proportional to the frequency.The power density, compared with white noise, decreases by 3 dB per octave (density is proportional to 1/f ).The brown noise refers to a power density which decreases 6 dB per octave with increasing frequency (density is proportional to 1/f 2 ).In order to create a noise with a frequency spectrum similar to the human speech we concatenated all the words of the vocabulary used is this work, spoken by 3 males and 3 females [25].The frequency spectrum of these different types of noise used is shown in 3.   The following cases were analyzed: i) speech signal corrupted by a speech signal from another speaker; ii) speech According to the results shown in Fig. 4, we verify that in case the speech signal is corrupted by low level noise composed of other speech signals with a signal-to-noise ratio of approximately 30 dB, the hit rate is the same as when the ideal binary mask was applied.In some cases there was a small performance improvement when using the ideal binary mask, as shown in Tables I and II.
We also verify that in the cases where the original signals were corrupted by other speech signals with signal-to-noise ratio levels of 1, 10 and 20 dB, there is an average growth of 18 percentual points in the hit rate when applying ideal binary masking.These results show that when various signals are mixed, spectral masking technique provides a gain of approximately 10 dB in noise level attenuation, significantly improving the speech recognition systems performance, as shown in Fig. 4.
We show the robustness of the spectral masking techniques against four types of noise as well: white, pink, brown and bubble.In cases where the signal is corrupted by different types of noise, with SNR of approximately 30 dB (low level noise), we notice a small performance improvement of the hit rate when the ideal binary mask was applied, as shown in Table IV.
Moreover, when applying ideal binary masking with higher levels of speech human noise (1,10 and 20 dB) we observe an average growth of 18 percent in the hit rate, similar to that observed in tables I, II and III for the same levels of signalto-noise ratio.On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results in an average growth of 9 % on the speech hit rate, with the same different signal-to-noise ratio.
Among the different types of noise, the experimental results reveal that the best hit rates, when applying ideal binary masking, were obtained while using the bubble noise.The worst results were obtained applying the ideal mask to white noise as in Fig. 5.This suggests that the masking spectral techniques works best for bubble noise, which is produced by human speech and justify its applications to realistic situations like human communication.

V. CONCLUSIONS
In this paper we presented an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition.Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other voice signals.A speech recognition system based on source separation using ideal binary masks was presented in order to investigate the performance improvement in speech recognition.The signals were corrupted by noise (speech signals and different types of noise) with different signal-to-noise ratios (1, 10, 20 e 30 dB) during the tests.We show the robustness of the spectral masking techniques in the presence of four types of noise: white, pink, brown and bubble.The main contribution of this study was the analysis of the limits of performance improvement of the recognition systems using ideal spectral masking.We verified an average improvement of 18% of the hit rates for signal-to-noise ratios of 1, 10 and 20 dB.We also showed that the spectral masking techniques when applied to mixtures corrupted by other speech signals provide an average gain of 10 dB in noise level attenuation, for the same conditions of signal-to-noise ratios mentioned above.The experimental results suggest that the masking spectral techniques are more appropriate for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

FramesFrequencyFig. 2 .
Fig. 2. Images of binary ideal masks from the sources (s 1 and s 2 ) for 100 speech frames.

Fig. 4 .
Fig. 4. Speech hit rate for the cases where the original speech signals were corrupted by other speech signals from different speakers with different signalto-noise ratios (1, 10 and 20 dB).The limits of performance improvement of the recognition system using ideal spectral masking (with masking) were analyzed.

Fig. 5 .
Fig. 5. Speech hit rate for the cases where the original speech signals were corrupted by different types of noise (white, pink, brown and bubble) with different signal-to-noise ratios (1, 10 and 20 dB).The limits of performance improvement of the recognition system using ideal spectral masking (with masking) were analyzed.

TABLE I THE
SPEECH SIGNALS WERE CORRUPTED BY ONE SPEECH SIGNAL FROM ANOTHER SPEAKER WITH DIFFERENT LEVELS OF SIGNAL-TO-NOISE RATIO (1, 10, 20 E 30 DB).THE TESTS WERE PERFORMED USING AN IDEAL BINARY MASK AND WITHOUT APPLICATION OF THE IDEAL MASK.

TABLE II THE
SPEECH SIGNALS WERE CORRUPTED BY TWO SPEECH SIGNALS FROM OTHER SPEAKERS WITH DIFFERENT LEVELS OF SIGNAL-TO-NOISE RATIO (1, 10, 20 E 30 DB).THE TESTS WERE PERFORMED USING AN IDEAL BINARY MASK AND WITHOUT APPLICATION OF THE IDEAL MASK.

TABLE III THE
SPEECH SIGNALS WERE CORRUPTED BY THREE SPEECH SIGNALS FROM OTHER SPEAKERS WITH DIFFERENT LEVELS OF SIGNAL-TO-NOISE RATIO (1, 10, 20 E 30 DB).THE TESTS WERE PERFORMED USING AN IDEAL BINARY MASK AND WITHOUT APPLICATION OF THE IDEAL MASK.

TABLE IV THE
SPEECH SIGNALS WERE CORRUPTED BY DIFFERENT TYPES OF NOISE WITH DIFFERENT LEVELS OF SIGNAL-TO-NOISE RATIO (1, 10, 20 E 30 DB).THE TESTS WERE PERFORMED USING AN IDEAL BINARY MASK AND WITHOUT APPLICATION OF THE IDEAL MASK.