Optimizing dictionary learning parameters for solving Audio Inpainting problem

—Recovering missing or distorted audio signal samples has been recently improved by solving an Audio Inpainting problem. This paper aims to connect this problem with K-SVD dictionary learning to improve reconstruction error for missing signal insertion problem. Our aim is to adapt an initial dictionary to the reliable signal to be more accurate in missing samples estimation. This approach is based on sparse signals reconstruction and optimization problem. In the paper two staple algorithms, connection between them and emerging problems are described. We tried to ﬁnd optimal parameters for efﬁcient dictionary learning.


I. INTRODUCTION
Since the time audio recording and transmittion have been discovered, there is always a possibility of creation an error in the signal.In times of analogue sound carriers they were sensitive to scratching a gramophone record or tearing a magnetic tape.Digital audio carriers are more resistent to these damages mainly because of Error Correction Codes but still transmitting audio signal through e.g.IP telephony can be affected by packet loss.
Many problems connected with analogue audio recordings restoration and their digitized copies have been solved in the past by various techniques.For example distorted or missing signal recover was repaired by interpolation techniques [1], samples repetition, wavelet transform [2] or neural networks [3].IP telephony is a common example of packet loss problem where data is lost during the transmition and they have to be recovered, e.g. by [4].
The recent research in the field of sparse signals representations has brought new approach that can be utilized for this audio restoration.Methods of solving a problem called Audio Inpainting were first mentioned in [5] as an algorithm for recovering missing samples in audio signal.This framework provides segmentwise restoration of audio signals where the position of missing samples is a-priori known.Sparse representation modeling uses Orthogonal Matching Pursuit method for solving an inverse problem.As an initial dictionary for sparse signals modeling, Gabor or DCT dictionaries are used.
function to construct the dictionary or adapting initial dictionary to the given signal.We decided to go the second way and tried to adapt an initial dictionary with the K-SVD algorithm [6].This method is a generalization of K-means clustering process and helps the dictionary to better fit observed audio data.
In section II sparse signals modeling together with Audio Inpaing as one of sparse signals modeling application are described.The dictionary learning via K-SVD algorithm is described in section III.Section IV shows our experimental results.Software components are described in section V and optimal parameters for dictionary learning which we found out are summed up in conclusion (section VI).

II. AUDIO INPAINTING BY SPARSITY CONSTRAINTS
Sparse signal representations for inpainting problems were first used in image signal processing [7] and a few years later the Audio Inpainting algorithm was introduced in [5].
For some given signal y ∈ R N it is known, which samples of it are reliable and which ones are distorted or missing.Therefore, we can divide the support of the vector into the two sets I r and I m containing the coordinates of the reliable and missing samples, respectively.As a consequence of the partitioning we find I r ∪ I m = {1, . . ., N } and I r ∩ I m = ∅.By deleting the rows with indices in I m from the N × N identity matrix we obtain the matrix M r ∈ R |I r |×N selecting the reliable samples from the signal y r = M r y. ( Using methods based on sparsity we have to find an appropriate function system to represent this class of signals by a few prototype functions only.The design of such a collection of atoms into which we can efficiently expand our signal, will be the central concern of this contribution.Given a set of atoms {d j } M j=1 one can build the dictionary matrix D ∈ R N ×M , where the columns of the matrix are the atoms.For a given coefficient vector c we can then compute the corresponding signal as a linear combination of the dictionary atoms by the matrix multiplication y = Dc. ( From now on, we will assume that D has a full rank and that M ≥ N with the consequence that for a given signal y we can always find a set of coefficients such that (2) is satisfied.Such an expansion can always be computed by the pseudo-inverse of D denoted by D + , i.e. c = D + y satisfy (2).An expansion c is called sparse, if only few entries of the vector are different from zero.We will denote the length of the support of the coefficient vector by c 0 .An expansion c, which has only few entries significantly different from zero, is called compressible.It is quite clear that for a given dictionary only a certain class of signals will admit sparse expansions.In the later sections we will be concerned with audio signals and for this specific class a number of different dictionaries has been proposed, among them Gabor and DCT systems [5].
As we need these dictionaries later for performance comparison, we will define them now.For 0 ≤ j < N , we define the corresponding atom in the points 0 ≤ m < N The resulting transform is purely real and will admit a certain level of sparsity for audio signals due to its harmonic structure.
Gabor dictionaries can be constructed as DCT atoms with additional phase information.Due to the additional parameter, there is possibly a better fit to the signal, making the coefficients sparser.For some 0 ≤ j < N , 0 ≤ n < N and ϕ ∈ (0, 2π) we define Now we want to turn to the problem of audio inpainting, i.e. we want to reconstruct the missing samples in an audio signal.The problem we have to solve is: given the reliable samples y r = M r y reconstruct the full signal y under the assumption that y can be represented by a sparse set of coefficients.Incorporating this information, we formulate the following optimization problem Solving this NP-hard non-convex optimization problem is not feasible, therefore we will use orthogonal matching pursuit (OMP), a greedy algorithm, instead [8].In the following we will describe the steps in detail in Algorithm 1 as it is not only used for the audio inpainting but also plays a major role in the K-SVD algorithm presented in Section III.This algorithm always yields a coefficient vector with c ≤ s for some user determined sparsity parameter s.
Due to the length of the signals under investigation, one usually processes the data segment-wise.Therefore, the steps explained above have to be executed for every time slice and then one re-synthesizes from these individually obtained small portions.While for some pre-specified dictionaries (such as Gabor and DCT) there exist fast transforms, it would not be feasible to perform the steps described above for the dictionaries that we will construct in Section III.

A. Overview
Static prespecified dictionaries like the introduced DCT or Gabor dictionaries are efficient because the transform process can be realized in a fast way.They are usually tailored to a specific group of signals.In the following we will discuss how to flexibly construct a dictionary adapted to the signal allowing for sparser representations.
Choose index j with maximal absolute value in Add the j-th entry of D + Ωi y to the j-th entry of c k This process has got two main stages: first one is to make a training set of signals from reliable segments of the input audio data.These portions of samples are selected from the input signal following user specifications.In the second stage, an adapted dictionary is obtained from the learned data.Designing a new adapted dictionary brings additional computational burden because each dictionary atom has to be compared with the training data.

B. Brief history of Dictionary Learning
The idea of dictionary learning was first introduced in 1996 by Olshausen.This method was called Maximum Likelihood Method [9], other method called Method of Optimal Directions (MOD) was introduced in 2000 by Engan et al. [10].Another approach that used Maximum A-Posteriori Probability was introduced by Engan (1999) [11] and Murray (2001) [12].Few years later Lesage et al. presented Unions of orthonormal bases [13].

C. The K-SVD algorithm
This algorithm was first presented by Aharon et al. [6].It is inspired by the K-means algorithm [14] solving the vector quantization problem.Vector quantization is a process where training examples are assigned to their nearest neighbors, each example is represented with just one coefficient and given the coefficients, dictionary D atoms are constructed.There is an obvious relation between sparse representations and quantization.Vector quantization is an extreme sparse representation when only one atom is allowed in the signal decomposition and this coefficient value must be 0 or 1.
As a predecessor and the closest algorithm of K-SVD is MOD (Method of Optimal Directions) which updates the whole dictionary in each learning iteration.The advantage of K-SVD is that it updates just one vector (atom) in each iteration and at the same time coefficients corresponding to this atom are updated, therefore the convergence is accelerated.
The goal of this algorithm is to adapt a dictionary D to represent the input signal y k more sparsely by using any pursuit algorithm that approximates an optimization problem ĉk = arg min In this case we use OMP algorithm, which is described in Section II.The K-SVD algorithm is described below (2), details can be found in [8] [6].
S max . . .max. sparsity of vectors c i Repeat until convergence (stopping rule): Sparse coding 1: Solve using any pursuit algorithm min

Dictionary update
For each atom k = 1, 2, . . ., K in D J−1 update by: 2: Set the group of indices using updated atom Every time the dictionary D is modified, it has to be checked whether it is 2 -normalized.
There is no guarantee that the K-SVD algorithm can reach a global or even a local minimum.The purpose of this paper is to find basic approximated parameters for the convergence of audio signals dictionary learning using K-SVD algorithm.However, there is no possibility to check the convergence inside the algorithm, the only chance is to do it externally by comparing the results with another experiment.

A. Optimizing K-SVD parameters
Audio inpainting presented in [5] was performed only with static dictionaries.That was the motivation for using the K-SVD algorithm to adapt the dictionary on the observed signal and therefore improve the reconstruction of missing signals.Software utilized for our experiments is specified in Section V. Audio Inpainting Toolbox also contains testing wave files with speech (sampling frequency f s = 8 kHz, 16 kHz) and music (f s = 16 kHz).We have done several tests to obtain optimal parameters for dictionary learning and with these parameters we tried to compete our approach with static dictionaries.These tests were performed on one channel audio RMSE according to number of iterations using different initial dictionaries.file music07 16kHz.wavwith sampling frequency f s = 16 kHz and length of 5 seconds.
After each iteration of dictionary learning via K-SVD algorithm, Root Mean Square Error (RMSE) is computed by After a few iterations RMSE settles at some value and remains unchanged.You can see in fig. 1 that four iterations are enough to reach satisfying RMSE value and after about 10 iterations RMSE is stabilized at its minimum.Because the lowest RMSE Other experiment was focused on minimizing RMSE according to space between segments obtained from reliable samples to get the training data.If you have a short audio file and you do not have enough training segments of the signal, it has to be decided between smaller segment shift for more training data and larger segment shift for less training data.However, decreasing the segment shift is nothing but artificial enlarging the amount of training data and the samples are repeated in training segments.We got results presented in fig. 2. Using the audio file mentioned above of length 80 000 samples, we have the segment length of 256 samples, redundancy factor 3, therefore dictionary D has got a size of 256 × 768.With these parameters we can set up a shift of segments from interval (1; 100).You can see that by increasing the segment shift value we get smaller RMSE during the dictionary learning process.
One of parameters of the dictionary learning explored further is the maximum number of nonzero coefficients S max .For S max ∈ {1, 2, 3, 4, 5} dictionary learning experiments was made with focus on lowest RMSE depending on different S max and therefore reaching the minimal error.In figure 3 it is obvious that after six iterations the minimal RMSE is reached by S max = 3 and remains minimal with very little change.Choosing the number of iterations has to be set up deliberately, since the number can be small and RMSE will remain high (the dictionary is not adapted as much as it can be) or the number can be too high and after reaching the minimum RMSE the algorithm can waste the time with new iterations or worse the RMSE can go up.That is why another experiment observing RMSE was performed with best parameters obtained above.Figure 4 shows that satisfying RMSE can be obtained with three or four iterations.This test was done for number of iterations from interval (1; 200).For different signals number of iterations for settling, the RMSE can be various and during our experiments we used number of iterations of 50.
After settling the RMSE value on its minimum values were oscilating, therefore for all of the plots above, Matlab Curve  Fitting Toolbox was utilized to make a trendline (approximation by quadratic function).

B. Dictionary comparison on real signals inpainting
Now we will show the comparism of audio inpainting results of different sound files and utilizing various dictionaries.Both static (DCT and Gabor) and trained versions of these dictionaries were used to compare reconstruction results.The redundancy of all the dictionaries is 3.The parameters of the K-SVD dictionary learning algorithm are summarized in table I.The initial dictionary for K-SVD learning was filled with random values because during our experiments we got the most satisfying dictionary learning process.In each of audio files a gap (sequence of samples with zero value) is made with the size from 1 to 240 samples and evaluation of the signal reconstruction process is computed as Signal-to-Noise Ratio (SNR) only for missing samples by Our first experiments were made with male (male04 16kHz.wav)and female voice (female04 16kHz.wav)speaking English.The hole was generated starting at 6 000 th sample.In figure 5 you can see female voice reconstruction results.The best SNR values were obtained using Gabor dictionary and as you can see the trained dictionary (supposed to better approximate the input signal) is in some cases worse than static dictionaries.
Figure 6 shows the reconstruction results of male speech and you can see that results of different dictionaries are almost the same, but starting 160 samples hole length the trained dictionary overcomes static dictionaries.
Other experiments were done for audio files containing music samples.The gap was generated with 33 000 th sample starting.First music file (music06 16kHz.wav)contains woman voice singing and the reconstruction error is shown in figure 7. Here, Gabor and trained dictionary are overcoming the DCT dictionary and K-SVD trained dictionary looks like to be more stable in larger gaps.
A sample of drums playing is recorded in (mu-sic07 16kHz.wav)file.In figure 8 you can see that all the dictionaries produce more or less the same reconstruction results in the sense of SNR.
Last experiment was performed with guitar playing music sample (music11 16kHz.wav).Figure 9 shows that for gap length from 40 to 110 K-SVD trained dictionary strongly overcomes static dictionaries of about 10 dB.Static dictionaries reconstruction results are almost the same.

V. SOFTWARE
For Audio Inpainting process we used the Audio Inpaiting Toolbox (http://small-project.eu/keyresults/audio-inpainting)and the dictionary learning was done by SMALLbox v. 1.9 (http://small-project.eu/software-data/smallbox/).These toolboxes are freely downloadable from the given links.Source m-files for reproducing our experiments were created on functions from Audio Inpainting Toolbox and can be downloaded from [15].
Files are created and may be run by using MATLAB.It has to be noted that results obtained by using random dictionary could differ because every time you run randn() function in MATLAB, you get a new matrix.

VI. CONCLUSION
In this paper, we presented a connection of two techniques to improve the reconstruction of missing audio signal infor- mation.Solving the Audio Inpainting problem was done by Orthogonal Matching Pursuit algorithm and adapting of the dictionary was processed by the K-SVD algorithm.Both of them are described in the text.The adapted dictionary was compared with static DCT and Gabor dictionaries.
Our aim was to find optimal parameters for dictionary learning via K-SVD and compare these results with static dictionaries.Construction of the building blocks of the dictionary is signal dependent and there are no general rules for setting up the parameters.In most cases the trained dictionary overcomes static dictionaries but it has to be taken into account the higher computational load.However, there are cases, where the reconstruction error of trained dictionary was the worse among these three dictionaries.Using real signals we have done several tests which are presented in the paper.
Future work will be focused on construction of the dictionary atoms and moving this problem to time-frequency plane.Knowledge presented in the paper will be utilized for solving real problem of old traditional music recordings which have signal gaps surrounded in high noise level.

8 :R
= y − D Ωi c k 9: end while Output: c. . .sparse coefficients approximating y Fig. 1.RMSE according to number of iterations using different initial dictionaries.

Fig. 2 .
Fig. 2. RMSE according to shift of the original signal segmentation for training data.

1 S max = 2 S max = 3 S max = 4 S max = 5 Fig. 3 .
Fig.3.RMSE according to max.number of non-zero coefficients during the dictionary learning.