簡易檢索 / 詳目顯示

研究生: 謝宛庭
Hsieh, Wan-Ting
論文名稱: 利用多特徵訓練對吉他演奏進行自動採譜
Automatic Music Transcription of Guitar Performance Using Multitask Learning
指導教授: 陳柏琳
Chen, Berlin
蘇黎
Su, Li
口試委員: 王家慶
Wang, Jia-Ching
黃郁芬
Huang, Yu-Feng
陳柏琳
Chen, Berlin
蘇黎
Su, Li
口試日期: 2022/09/07
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 60
中文關鍵詞: 自動音樂採譜吉他轉譜深度學習多音預測多任務學習
英文關鍵詞: Automatic Music Transcription, Guitar Transcription, Deep Learning, Multi-Pitch Estimation, Multitask Learning
DOI URL: http://doi.org/10.6345/NTNU202201836
論文種類: 學術論文
相關次數: 點閱:141下載:12
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自動音樂採譜Automatic Music Transcription (AMT) 定義為將原聲音樂訊號轉換成音樂記號。在過去的研究裡,較多研究是針對鋼琴獨奏或者多樂器演奏去進行自動採譜,而較少AMT系統針對吉他所彈奏出的音樂去做類似工作。因為吉他歌曲通常是在六根弦上,以不同的指法、刷弦、和弦進行等技巧去彈奏,其中還可能會有單音、和弦的彈奏方式。模型必須在一首吉他演奏曲中,從六根不同的弦所彈奏而成的豐富的諧波,辨識出所彈奏的音符。在一首歌曲中,單音的部分極大機率為和弦音,且大部分的音傾向於出現在拍點、或在拍點相關位置(後半拍)。因此,在這項研究中,我們將針對以吉他彈奏出的歌曲做自動採譜,除了使用音符(Note)做為輸出標籤,也將側面資訊:和弦(Chord)、拍點(Beat)一併考慮。

    過去在AMT的子任務裡,音符層級的採譜任務 (Note-level Transcription) 通常只會使用音符做為輸出標籤。我們做了數個多任務學習(Multitask learning)的實驗,同時輸出音符、和弦以及拍點標籤,希望能藉此提高音符在吉他曲中轉錄的效能,同時也記錄了和弦辨識、拍點追蹤在這個系統裡的功效。

    Automation Music Transcription (AMT) is the conversion of acoustic music signals into musical notations. Most previous AMT studies focused on solo-piano or multi-instrument transcription, while a few AMT systems did similar work on guitar audio. A guitar piece is composed by playing on six strings, with varied fingerings, brush strings, chord progression, and other techniques. Furthermore, the guitar piece would sound different depending on whether it was played solo or played with chords. An AMT model must identify the notes in a guitar piece with rich harmonic information played from six distinct strings. Furthermore, the note has a significant probability of being a chord, and most notes tend to appear at the beat position or the relevant position (i.e., the second half of the beat). To deal with the complex musicality of guitar pieces, in this research, we will not only use note information as the output label for the AMT system on guitar transcription but also take the chord and beat information into consideration.

    In most AMT subtasks, note-level transcription tasks used only note information as out-put labels. We have performed several experiments with multitask learning and output notes, chords, and beat labels simultaneously, trying to improve the performance of notes transcription in guitar audio. We have also recorded the predicted results of chord recognition and beat tracking in this system.

    Chapter 1 Introduction 1 1.1. Background and Motivation 1 1.2. Data Representations 2 1.3. Proposed Model 6 1.4. Achievement 12 1.5. Agenda of the Thesis 13 Chapter 2 Related Work 14 2.1. Background of AMT 14 2.1.1 Four Levels of AMT 16 2.2. Datasets 18 2.2.1 Introduction to GuitarSet 19 2.2.2 GuitarSet Visualization 21 2.3 Model 24 2.3.1 DeepLabV3+ 24 2.3.2 Attention-based bottleneck U-net architecture 25 Chapter 3 Methodology 27 3.1. Data Representations 28 3.2. One-hot encoding 30 3.3. Label Extension 34 3.4. Model35 3.5. Focal Loss37 3.6. Post-processing 39 Chapter 4 Experimental Setup 42 4.1. Settings 42 4.2. Guitar dataset 45 4.3. Training 46 4.4. Evaluation Method 47 Chapter 5 Experimental Results 49 5.1. Baseline model and Model 1 49 5.2. Model 2 and Model 3 50 5.3. Model 3.1 and Model 4 52 5.3.1 Evaluating the prediction of Model 3.1 in diverse genres 53 Chapter 6 Conclusion and Future Work 54 6.1. Conclusion and Future work 54 Bibliography 56

    [1] Klapuri, Anssi and Manuel Davy, “Signal Processing Methods for Music Transcription,” 2007.
    [2] Chaitanya, M. Sai and Soubhik Chakraborty, “Musical information retrieval. Signal Analysis and Feature Extraction using Python,” GRIN Verlag, 2021.
    [3] Benetos, Emmanouil, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff and Anssi Klapuri, “Automatic music transcription: challenges and future directions,” Journal of Intelligent Information Systems, 2013, 41.3: 407-434.
    [4] Benetos, Emmanouil, Simon Dixon, Zhiyao Duan and Sebastian Ewert, “Automatic Music Transcription: An Overview,” IEEE Signal Processing Magazine, 2018, 36.1: 20-30.
    [5] Wu, Yu-Te, Berlin Chen and Li Su, “Polyphonic Music Transcription with Semantic Segmentation,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 166-170, doi: 10.1109/ICASSP.2019.8682605.
    [6] Rigaud, François and Mathieu Radenen, “Singing Voice Melody Transcription Using Deep Neural Networks,” in Proceedings of the 17th International Society for Music Information Retrieval Conference(ISMIR), 2016, pp. 737-743
    [7] Wu, Yu-Te, Berlin Chen and Li Su, “Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2796-2809, 2020, doi: 10.1109/TASLP.2020.3030482.
    [8] Xi, Qingyang, Rachel M. Bittner, Johan Pauwels, Xuzhou Ye and Juan Pablo Bello, “GuitarSet: A Dataset for Guitar Transcription,” in Proceedings of the 19th International Society for Music Information Retrieval Conference(ISMIR), 2018, pp. 453-460
    [9] Reddy, Chandan K. A., Vishak Gopa, Harishchandra Dubey, Sergiy Matusevych, Ross Cutler and Robert Aichner, “MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection,” arXiv preprint arXiv:2110.04331.
    [10] Ycart, Adrien, Emmanouil Benetos and Demo Papers, “A-MAPS: Augmented MAPS Dataset with Rhythm and Key Annotations,” 2018.
    [11] Hawthorne, Curtis, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel and Douglas Eck, “Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset,” arXiv preprint arXiv:1810.12247.
    [12] Bittner, Rachel M., Juan J. Bosch, David Rubinstein, Gabriel Meseguer-Brocal and Sebastian Ewert, “A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 781-785, doi: 10.1109/ICASSP43922.2022.9746549.
    [13] Miron, Marius, Jordi Janer and Emilia Gómez, “Monaural Score-Informed Source Separation for Classical Music Using Convolutional Neural Networks,” in Proceedings of the 18th International Society for Music Information Retrieval Conference(ISMIR), 2017.
    [14] Duan, Zhiyao, Bryan Pardo and Changshui Zhang, “Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions,” IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18.8: 2121-2133
    [15] Duan, Zhiyao and David Temperley, “Note-level Music Transcription by Maximum Likelihood Sampling,” in Proceedings of the 15th International Society for Music Information Retrieval Conference(ISMIR), 2014, pp. 181-186
    [16] Duan, Zhiyao, Jinyu Han and Bryan Pardo, “Multi-pitch Streaming of Harmonic Sound Mixtures,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2013, 22.1: 138-150.
    [17] Chen, Yu-Hua, Wen-Yi Hsiao, Tsu-Kuang Hsieh, Jyh-Shing Roger Jang and Yi-Hsuan Yang, “Towards Automatic Transcription of Polyphonic Electric Guitar Music: A New Dataset and a Multi-Loss Transformer Model,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 786-790, doi: 10.1109/ICASSP43922.2022.9747697.
    [18] Maman, Ben and Amit H. Bermano, “Unaligned Supervision For Automatic Music Transcription in The Wild,” arXiv preprint arXiv:2204.13668, 2022.
    [19] Humphrey, Eric J., Justin Salamon, Oriol Nieto, Jonathan P. Forsyth, Rachel M. Bittner and Juan Pablo Bello, “JAMS: A JSON Annotated Music Specification for Reproducible MIR Research,” in Proceedings of the 15th International Society for Music Information Retrieval Conference(ISMIR), 2014, pp. 591-596
    [20] Chen, Liang-Chieh, Yukun Zhu, George Papandreou, Florian Schroff and Hartwig Adam, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” in Proceedings of the European conference on computer vision (ECCV). 2018. pp. 801-818.
    [21] Zeng, Haibo, Siqi Peng and Dongxiang Li, “Deeplabv3+ semantic segmentation model based on feature cross attention mechanism,” Journal of Physics: Conference Series. IOP Publishing, 2020. pp. 012106.
    [22] Su, Li and Yi-Hsuan Yang, “Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 10, pp. 1600-1612.
    [23] Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, "Focal Loss for Dense Object Detection," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318-327, 1 Feb. 2020, doi: 10.1109/TPAMI.2018.2858826.
    [24] Lu, Wei Tsung and Li Su, “Vocal Melody Extraction with Semantic Segmentation and Audio-symbolic Domain Transfer Learning,” in Proceedings of the 19th International Society for Music Information Retrieval Conference(ISMIR), 2018, pp. 521-528
    [25] Gardner, Josh, Ian Simon, Ethan Manilow, Curtis Hawthorne and Jesse Engel, “MT3: Multi-Task Multitrack Music Transcription,” arXiv preprint arXiv: 2111.03017, 2021.
    [26] Raffel, Colin, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang and Daniel P. W. Ellis, “MIR_EVAL: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Society for Music Information Retrieval Conference(ISMIR), 2014.
    [27] Wu, Yu-Te, Yin-Jyun Luo, Tsung-Ping Chen, I-Chieh Wei, Jui-Yang Hsu, Yi-Chin Chuang and Li Su, “Omnizart: A General Toolbox for Automatic Music Transcription,” Journal of Open Source Software, vol. 6, no. 68, pp. 3391, Dec. 2021, doi: 10.21105/joss.03391.
    [28] Emiya, Valentin, Roland Badeau and Bertrand David, “Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1643-1654, Aug. 2010, doi: 10.1109/TASL.2009.2038819.
    [29] Thickstun, John, Zaïd Harchaoui, Dean P. Foster and Sham M. Kakade, "Invari-ances and Data Augmentation for Supervised Music Transcription," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2241-2245, doi: 10.1109/ICASSP.2018.8461686.
    [30] Alexandra, “Guitar Chords Midi pitches,” https://data.world/alexandra/guitar-chords-midi-pitches.

    下載圖示
    QR CODE