Information Systems and

Machine Learning Lab.

Stiftung Universität Hildesheim

Marienburger Platz 22

31141 Hildesheim

Hadi Samer Jomma

Supervisor

Piano Transcription With Convolutional Sparse Lateral Inhibition

Abdullah Al Foysal

277458, [email protected]

Abstract

Automatic Piano transcription is the process of creating a musical score (i.e., a symbolic representation) from an audio recording. In this paper, authors extend their previous work on automatic transcription of piano music in a context-dependent setting. In their previous work, they approached convolutional sparse coding to approximate the music waveform as the summation of piano note waveforms (dictionary elements) convolved with their temporal activations (onset transcription). The piano note waveforms are pre-recorded for the specific piano to be transcribed in the specific environment. A dictionary containing multiple waveforms per pitch is generated by truncating a long waveform for each pitch to different lengths. In this approach a sparsity penalty promotes globally sparse activations of the dictionary elements, and a lateral inhibition term penalizes concurrent activations of different waveforms corresponding to the same pitch within a temporal neighborhood, to achieve note length estimation.

Contents

TOC o “1-3” h z u 1. Introduction / Motivation PAGEREF _Toc362265123 h 42.Structured Sparsity…………………………………………………………………………………………………………………..5

3.Proposed Method……………………………………………………………………………………………………………………….6

4.Algorithm…………………………………………………………………………………………………………………………………7

5.Comparison………………………………………………………………………………………………………………………………9

5.1Figures………………….……………………………………………………………………………………………………………9

5.2Tables……………….………………………………………………………………………………………………………………10

6.Experiments……………………………………………………………………………………………………………………..…….11

6.Discission and Conclusion…………………………………………………………………………………………………..……12

References…………………………………………………………………………………………………………………………………13

1. Introduction / MotivationIn music parlance, Automatic music transcription (AMT) is a piece of music that has been re-scored for a different instrument or instruments than the original. So if the piece was originally written for an orchestra, it could be transcribed for a solo piano. AMT has applications in music transcription such as providing feedback to a piano learner, content-based music search, visualizing the music content and musicological analysis of non notated music. AMT systems aim to achieve note-level transcription must estimate three basic parameters of a musical note. Those basic parameters are pitch, onset and offset. Pitch is a term we used to describe how high and low sounds are. Onset refers to the beginning of a musical note or other sound. There has been quite a lot of work done on music transcription but most of the research has focused on pitch and onset detection rather than offset detection 1. Many note-level music transcription methods are frame based. They attempt to identify pitches in each time frame then determine note onsets and offsets through post processing 1. Spectrogram decomposition is used in many most popular approaches in this category and use either nonnegative matrix factorization (NMF) 2, 3 or probabilistic latent component analysis (PLCA) 4. A post processing step such as a median filter or a hidden Markov model (HMM) is used to obtain note-level transcription results. It is required to estimate note onsets and offsets from frame-level pitch estimates 5. Other frame-based methods include deep neural networks 6–9, and probabilistic methods, such as 10–14. In both the waveform and the spectral content, piano notes are characterized by significant temporal evolution. In particular, different partials decay at different rates, i.e., higher frequency partials decay faster than lower frequency ones 15–17.

In 18 authors proposed a generalization of PLCA to account for the temporal evolution of each note. To learn multiple spectral templates for each note and a Markov chain describing the temporal transition between them Mysore et al. 19 introduced a variant of NMF called nonnegative factorial HMM (N-FHMM). Ewert et al. 20 recently proposed a dynamic programming variation of N-FHMM to reduce its high computational cost. This method has been extended and adapted to piano music transcription by Cheng et al. 21. Nonnegative matrix de-convolution (NMD) as introduced in 22 is capable of modeling the temporal evolution of nonstationary sounds. All these methods are capable of estimating the note length, but they are generally evaluated on onset-only estimation 23. Even the most recent MIREX contest shows that most algorithms cannot achieve good results in both onset detection and length estimation (see MIREX 2016 24).

In 25 and 26, authors proposed a time-domain approach, which they referred to as CDW-15 in the following, to address piano music transcription in a context-dependent setting. CDW-15 approximates the music signal s with the summation of note waveforms {dm} convolved with their temporal activation coefficients {xm}:

where ? is a regularization constant. The waveform dm of each individual pitch m is pre-recorded on the same piano in the same environment as the music signal to be transcribed, and its length is always truncated to 1s. The l1 regularization term encourages sparse activations of notes, higher values of ? result in sparser activations. This approach achieves higher accuracy in pitch and onset estimation than the state-of-the-art 5, but it does not estimate note offsets.

Here authors extended CDW-15 to estimate the note length by using a dictionary containing multiple atoms with different lengths per pitch, thus creating pitch groups of atoms corresponding to the same pitch to avoid concurrent activations of multiple atoms in the same pitch group when using multiple atoms per pitch. In order to achieve this result, authors imposed a lateral inhibition 27 regularization term on the activation coefficients of atoms in the same pitch group, in addition to the l1 regularization on all atoms. The lateral inhibition regularization prevents concurrent activation of multiple atoms in the same pitch group within a temporal neighborhood. This property is called within-group sparsity.

2. Structured Sparsity

Structured sparsity has recently emerged in statistics, machine learning and signal processing as a promising paradigm for learning in high-dimensional settings. Standard sparsity assumes a representation that has only a few nonzero coefficients, but makes no additional assumptions on how these nonzero coefficients are distributed within the coefficient vector or matrix. Structured sparsity assumes that there is some sort of identifiable structure to the distribution of these coefficients. This structure can take many forms but joint sparsity and group sparsity 28 are the most common forms. Both of these types of structure can be promoted by the use of the l2,1 norm 28.

In AMT Structured sparsity has previously been applied like in an NMF framework, a dictionary with multiple atoms per pitch can be learned, in which each atom in the same group represents a different frame of a long note of a particular pitch. Group sparsity introduced to promote multiple atoms in the same group to be activated contiguously. O’Hanlon et al. 29 introduced an example of such structured sparcity, who used a modified nonnegative basis-pursuit greedy approach. O’Hanlon and Plumbley 30 proposed an approach of group sparsity in an NMF framework to promote the co-activation of harmonically related narrow-band atoms. Here each group still represents a single pitch, but each pitch is sliced harmonically, not temporally as in the previous method. Authors used group sparsity for limiting the number of concurrently active atoms inside each group, as each atom represent a full node. Global sparsity on the activations must also be promoted in order to obtain a good transcription because group sparsity alone is not sufficient to achieve a good transcription.

3. Proposed Method

The main idea of the proposed method is to jointly estimate pitch, onset, and duration of notes by using a dictionary containing multiple atoms with different length for each pitch in the convolutional sparse coding (CSC) framework mention in equation (1). Authors truncate 1-s long template trained as in CDW-15 27 for different lengths to create the dictionary. This approach is faster and easier than sampling the same pitch played with different durations.

From Figure 1(a) we can see that at the beginning of the signal and slightly after t = 1.5 s multiple templates in the same pitch group are activated concurrently as a result expanding the dictionary does not lead to satisfying results. The reason is that the l1 regularization in equation (1) only promotes sparse activations of all templates across all times, but does not distinguish activations of templates in the same pitch group from activations of templates in different pitch groups. It is unlikely to play the same note repeatedly too quickly but it is possible for a player to play different notes in a rapid pace 26. For that we need a regularization term that distinguishes these activations and penalizes close activations of templates in the same pitch group.

Authors proposed to use a lateral inhibition 27 regularization term on the activations of templates in the same pitch group within a temporal window. The cost of activating atom m at time t is given by

Where G(m) is the pitch group to which atom m belongs, and T is the length of the temporal window of inhibition. The activation of atom m at time t will inhibit the activation of all the other atoms in the same pitch group within the temporal window around t. To avoid self-inhibition, the term |xm(t)| needs to be subtracted from the summation.

We can find the full regularization term by using the summation of all the costs over all atoms and all time instants, multiplied by a constant ?. The objective function becomes

From Figure. 1(b) we can see that, in the same pitch group this objective function minimizes the concurrent activations of atoms and inside the inhibition time window (50 ms), but the activations are not globally sparse over time. Global sparsity is a main component of CDW-15 and it has been successfully applied to AMT for a long time 1. Authors added a global l1 norm to the basic lateral inhibition model in equation (10) in order to promote global sparsity on all activations of all templates. The objective function with both the global l1-norm regularization and lateral inhibition regularization is

From Figure. 1(c), we can see that the activation vectors are now sparser and less noisy, and also globally sparse using this regularization, as it will show in the experimental section.

4. Algorithm

The form of Lateral inhibition structured sparse coding problem 28 is

Where D is a dictionary matrix, and ? is a matrix encoding the pattern of desired mutual inhibitions. It was mentioned in 27 that we can define w = |x|T ? if the entries of ? are nonnegative, and write down equation (5) as a weighted basis pursuit denoising (BPDN) problem

where O is the Hadamard product. This hadamard product is allowing the problem to be tackled by modifying a standard algorithm for the BPDN problem. In 27 the authors reported that good performance was obtained with a fast iterative shrinkage thresholding algorithm (FISTA) algorithm.

The innovation with respect to the algorithm is twofold. First they have modified the regularization term so that the lateral inhibition is specified by the product of a convolution filter determining the inhibition in time, and a small matrix that determines the inhibition within and between groups of dictionary atoms. They have modified it because they wished to apply their model to a signal s that is far too large for an explicit weighting matrix ? to be practical. Second, authors have modified the ADMM algorithm proposed in 31 to include the necessary iterative reweighting because alternating direction method of multipliers (ADMM) has been shown to be more effective than FISTA for the convolutional BPDN (CBPDN) problem 32.

The lateral inhibition regularization terms in equation (3) and equation (4) are rewritten in terms of convolution as

where h is the time inhibition window, which is equal to 1 around the origin within a radius of T/2, and cm,n is defined as

Where G(m) and G(n) is the pitch group where atom m and n belong.

If we define

Then equation (3) can be rewritten as

which immediately shows that the regularization is a weighted l1-norm on xm. Similarly, equation (4) can be written as

We can find the following equation after combining the two regularization terms from equation (10) and equation (11)

Where 1 is a row vector comprised of all ones.

The resulting ADMM algorithm is very similar to the efficient ADMM algorithm for the CBPDN (Convolutional BPDN) problem 32,except for the use of a weighted l1 norm, which requires a minor modification to the soft-thresholding step 33, and in the need for recomputing the weight vector at every iteration, as described above.

5. Comparison

5.1Figures

Figure 1: Activations of the atoms for pitch D5 for the opening of Bach’s Minuet in G. D5 should be activated twice, at t = 0 and t = 1.6 s. The numbers on the vertical axis indicate the length of each template. (a) l1 regularization only. (b) Lateral inhibition regularization only. (c) Combined lateral inhibition and l1 regularizations.

5.2 TablesTable 1: Average Results On The First 10S Of The 30 PIECES in the ENSTDkCL Dataset Of MAPS (Higher values are better)

6. Experiments

Authors used different models described in proposed method section to the first 10s of the 30 pieces in the ENSTDkCI dataset of MIDI Aligned Piano Sounds (MAPS) 12. Here authors used a value of ? = 0.05 and ? = 0.5. These values were empirically tuned on a single piece and then fixed for the entire dataset. They calculated precision, recall, and F-measure with both onset-only and onset-offset criteria 23 for each piece. With the standard MIREX parameters: Onset tolerance of 50 ms and offset tolerance of 20% of the correct note length or 50 ms, whichever is longer. They also calculated AOR (the average overlap ratio). AOR gives a measure of how much a correctly returned note overlaps with the corresponding ground-truth note. They compared the proposed method with several baseline methods like CDW-15 (with note lengths fixed at 100 ms), BW-13 (a state-of-the-art frame-based method based on PLCA proposed by Benetos and Weyde) 34, SONIC (a piano music transcription system based on neural networks) 32,DT-14 (a generic music transcription system based on maximum likelihood by Duan and Temperley)13, and VBB-10 (an NMF-based transcription system by Vincent) et al. 35. BW-13 was also trained in the same context of the proposed method on the isolated notes in the ENSTDkCl dataset of MAPS. VBB-10, SONIC, and DT-14 cannot be trained in the same context, so the comparison is biased against these methods.

Table 1 shows the average results for the entire dataset. From Table 1 we can see that almost all variants of the CSC-based methods, except LI (Lateral Inhibition), outperform VBB-10, BW-13 and DT-14 on F-measure for the onset-only criterion; l1+LI and CDW- 15 also outperform SONIC, showing the advantage of the time-domain approach over frequency-domain methods in this setting. l1+LI method significantly outperforms both LI and l1 methods on F-measure. This supports their analysis that both within-group and global sparsity are needed.

F-measure drops significantly for the onset-only criterion but increases slightly for the onset–offset criterion from CDW-15 to l1. Here l1 uses nine templates per pitch while CDW-15 uses only one template. Multiple templates can be activated simultaneously in l1 resulting in higher recall but a lower precision and when onset offset criterion is used, the improvement on recall dominates the decrease on precision. Precision drops significantly . from CDW-15 to LI, when recall increases slightly under the onset-only criterion and significantly under the onset–offset criterion. However, the drop of precision is due to the false activation of wrong notes instead of the false activation of multiple templates of the correct note. LI+l1 slightly under-performs CDW-15 on F-measure, but significantly outperforms CDW-15 on AOR when onset-only criterion is used; when onset– offset criterion is used, LI+l1 falls behind SONIC on F-measure but significantly outperforms CDW-15 on both F-measure and AOR. Overall, the proposed method with both lateral inhibition and global sparsity regularization brings the CSC-based approach to the highest level of performance.

7. Discussion and ConclusionIn this proposed method authors used multiple templates with different lengths per pitch to achieve note length estimation. Here they introduced Lateral inhibition regularization to ensure that at most one template per pitch is activated within an inhibition window. Global sparsity is achieved through l1 regularization to reduce false activations of wrong notes. Experiments show that the proposed approach significantly outperforms a state-of-the-art music transcription method trained in the same context-dependent setting in transcription accuracy.

References:1 E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic music transcription: Challenges and future directions,” J. Intell. Inf. Syst., vol. 41, no. 3, pp. 407–434, 2013.

2 D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.

3 D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 556–562.

4 P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for acoustic modeling,” in Proc. Workshop Adv. Models Acoust. Process., 2006.

5 E. Benetos and S. Dixon, “A shift-invariant latent variable model for automatic music transcription,” Comput. Music J., vol. 36, no. 4, pp. 81–94, 2012.

6 J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-based polyphonic piano transcription approach using learned feature representations,” in Proc. Int. Soc. Music Inf. Retrieval, 2011, pp. 175–180.

7 S. B¨ock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” in Proc. IEEE Int. Conf. Audio, Speech, Signal Process., 2012, pp. 121–124.

8 N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription,” in Proc. 29th Int. Conf.Mach. Learn., Edinburgh, Scotland, U.K., 2012, pp. 1159–1166.

9 S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic piano music transcription,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, no. 5, pp. 927–939, May 2016.

10 M. Goto, “A real-time music-scene-description system: Predominant-f0 estimation for detectingmelody and bass lines in real-world audio signals,” Speech Commun., vol. 43, no. 4, pp. 311–329, 2004.

11 Z. Duan, B. Pardo, and C. Zhang, “Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 8, pp. 2121–2133, Nov. 2010.

12 V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1643–1654, Aug. 2010.

13 P. H. Peeling and S. J. Godsill, “Multiple pitch estimation using nonhomogeneous Poisson processes,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 6, pp. 1133–1143, Oct. 2011.

14 Z. Duan and D. Temperley, “Note-level music transcription by maximum likelihood sampling,” in Proc. Int. Soc. Music Inf. Retrieval, 2014, pp. 181–186.

15 A. Cogliati and Z. Duan, “Piano music transcription modeling note temporal evolution,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., Brisbane, Qld, Australia, 2015, pp. 429–433.

16 M. Campbell and C. Greated, TheMusician’s Guide to Acoustics. London, U.K.: Oxford Univ. Press, 1994.

17 T. Cheng, S. Dixon, and M. Mauch, “Modelling the decay of piano sounds,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., Brisbane, Qld, Australia, 2015, pp. 594–598.

18 G. C. Grindlay and D. P. W. Ellis, “Transcribing multi-instrument polyphonic music with hierarchical eigeninstruments,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 6, pp. 1159 1169, Oct. 2011.

19 G. J. Mysore, P. Smaragdis, and B. Raj, “Non-negative hidden Markov modeling of audio with application to source separation,” in Proc. Latent Variable Anal. Signal Separation, 2010, pp. 140–148.

20 S. Ewert, M. D. Plumbley, and M. Sandler, “Adynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., Brisbane, Qld, Australia, 2015, pp. 569–573.

21 T. Cheng, M. Mauch, E. Benetos, and S. Dixon, “An attack/decay model for piano transcription,” in Proc. Int. Soc. Music Inf. Retrieval, 2016, pp. 584–590.

22 P. Smaragdis, “Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs,” in Proc. Ind. Compon. Anal. Blind Signal Separation, 2004, pp. 494–499.

23 M. Bay, A. F. Ehmann, and J. S. Downie, “Evaluation of multiple-f0 estimation and tracking systems,” in Proc. Int. Soc. Music Inf. Retrieval, 2009, pp. 315–320.

24 MIREX2016 results. 2016. Online. Available: http://www.music-ir.org/ mirex/wiki/2016:Multiple_Fundamental_Frequency_Estimation_%26_ Tracking_Results_-_MIREX_Dataset

25 A. Cogliati, Z. Duan, and B. Wohlberg, “Piano music transcription with fast convolutional sparse coding,” in Proc. IEEE 25th Int.Workshop Mach. Learn. Signal Process, 2015, pp. 1–6.

26 A. Cogliati, Z. Duan, and B. Wohlberg, “Context-dependent piano music transcription with convolutional sparse coding,” IEEE/ACMTrans. Audio, Speech, Lang. Process., vol. 24, no. 12, pp. 2218–2230, Dec. 2016.

27 A. Szlam, K. Gregor, and Y. LeCun, “Structured sparse coding via lateral inhibition,” in Proc. 24th Int. Conf. Adv. Neural Inf., 2011, pp. 1116–1124.

28 P. Sprechmann, I. Ram´?rez, G. Sapiro, and Y. C. Eldar, “C-HiLasso: A collaborative hierarchical sparse modeling framework,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4183–4198, Sep. 2011.

29 K. O’Hanlon, H. Nagano, and M. D. Plumbley, “Structured sparsity for automatic music transcription,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2012, pp. 441–444.

30 K. O’Hanlon and M. D. Plumbley, “Polyphonic piano transcription using non-negative matrix factorisation with group sparsity,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2014, pp. 3112–3116.

31 B. Wohlberg, “Efficient convolutional sparse coding,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., Florence, Italy, 2014, pp. 7173–7177.

32 B. Wohlberg, “Efficient algorithms for convolutional sparse representations,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 301–315, Jan. 2016.

33 B.Wohlberg, “Convolutional sparse representations as an image model for impulse noise restoration,” in Proc. IEEE Image, Video,Multidimensional Signal Process. Workshop, Bordeaux, France, Jul. 2016, pp. 1–5.

34 E. Benetos and T. Weyde, “An efficient temporally-constrained probabilistic model for multiple-instrument music transcription,” in Proc. Int. Soc. Music Inf. Retrieval, 2015, pp. 701–707.

35 E.Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral decomposition for multiple pitch estimation,” IEEE Trans. Audio, Speech Lang. Process., vol. 18, no. 3, pp. 528–537, Mar. 2010. Online. Available: https://hal.inria.fr/inria-00544094