Polyphonic Piano Transcription With a Note Based Music Language Model

Polyphonic Piano Transcription

Automatic music transcription (AMT) converts raw performance audio signals into digital representations of symbolic music for computational musicology purposes. Polyphonic piano transcription is one of the most challenging AMT tasks, owing to the simultaneous appearance of several notes in the performance and their interdependent dynamics.

This paper explores polyphonic piano transcription with a note based music language model, which aims to improve the AMT results by directly modeling the note transitions without the need of acoustic models. To achieve this goal, the model uses an LSTM-RBM structure that learns long term dependency between music sequences and predicts the probability of a given frame to contain a certain note. The model is evaluated on a benchmark AMT dataset and shows a significant improvement in AMT results over a standard RNN-based acoustic model.

https://www.tartalover.net/

AMT is a broad field of research, and different instruments have their own unique timbral characteristics, which require a specific set of constraints in AMT systems. Instrument-specific transcription research attempts to model the timbral characteristics of the instrument in question, and this has been shown to be an effective approach for improving AMT performances. However, the limited number of parameters available to instrument-specific models limits their generalization power to unseen combinations of instruments.

Polyphonic Piano Transcription With a Note Based Music Language Model

Most AMT methods use a recurrent neural network for predicting frame-to-frame note events, and then decode the resulting frame probabilities into the corresponding note states by using a simple rule (e.g., a note is identified as a white if it appears in the first position and a black if it appears in the last position). This approach has proven to be effective for generating accurate frame-to-frame prediction of MIDI data. However, there are still several limitations and open questions in the current state of AMT.

Among the limitations of existing frame-based AMT approaches, one is that they may miss some note onsets due to their inability to estimate the pitches of notes. To overcome this, some frame-based AMT methods split the prediction process into two steps: a onset detection stage and an additional pitch estimation stage. Various solutions have been proposed to address this problem, such as integrating the estimation of pitches and onsets into a single framework [14]. Kameoka used harmonic temporal structured clustering to estimate note attributes simultaneously in a non-destructive fashion. Cogliati and Duan modeled the temporal evolution of piano notes through convolutional sparse coding, while Ewert employed spectro-temporal patterns in supervised NMF.

The note-based MLM combines a recurrent neural network with a graph convolution network to model the temporal correlation between notes. The model is end-to-end trainable through joint training of feature learning and label learning. Experiments on public piano data sets show that the model is able to mine more co-existing notes than the previous AMT methods, and it yields superior frame-level and note-level transcription results. The model is further validated by predicting the notes at blank onsets in AMT thresholding transcription experiments. The resulting predictions are compared to the corresponding ground truth piano rolls and show a good correspondence between the note states and the corresponding frames in the decoded piano roll.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top