研究生: |
陳俊彥 Chen, Chun-Yen |
---|---|
論文名稱: |
整合全局場景與局部注意的自監督多標籤分類 From Whole to Parts: Integrating Global Context and Local Attention for Self-Supervised Multi-Label Classification |
指導教授: |
葉梅珍
Yeh, Mei-Chen |
口試委員: |
王鈺強
Wang, Yu-Chiang 康立威 Kang, Li-Wei 葉梅珍 Yeh, Mei-Chen |
口試日期: | 2023/07/24 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 43 |
中文關鍵詞: | 自監督學習 、對比學習 、多標籤分類 |
英文關鍵詞: | Self-supervised learning, Contrastive learning, Multi-label classification |
DOI URL: | http://doi.org/10.6345/NTNU202301210 |
論文種類: | 學術論文 |
相關次數: | 點閱:103 下載:7 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
自監督學習在各種計算機視覺任務中取得了顯著的成果,證明了其在廣泛應用中的有效性。然而,儘管取得了這些成功,針對多標籤分類的挑戰的研究工作仍相對有限。該領域尚待深入探討,需要進一步研究以充分利用自監督學習技術進行多標籤分類任務。
在這篇論文中,我們提出了一個適用於自監督多標籤分類的多層次表徵學習(GOLANG)框架,同時捕捉圖像的場景和物件資訊。我們的方法結合了全局場景和局部對齊,以捕捉圖像中不同層次的語義信息。框架的全局模組通過對輸出特徵進行平均池化來學習整個圖像,而局部對齊模組通過學習關注來消除與對象無關的干擾。
通過整合兩個模組,我們的模型能從影像中有效地學習各種層次的語義信息。為了進一步提高模型提取物件-場景關係的能力,我們引入了全局和局部交換預測技術,有效捕捉圖像中各種物件和場景之間的複雜關係。GOLANG框架在自監督多標籤分類的實驗上展示了優秀的性能,凸顯了其在在多標籤影像中捕捉多個物件和場景之間錯綜複雜關係的有效性。
Self-supervised learning has shown promising results in various computer vision tasks, proving its effectiveness in a wide range of applications. However, despite these successes, there has been limited work specifically addressing the challenges of multi-label classification. This area remains relatively underexplored, and further research is needed to fully harness the potential of self-supervised learning techniques for multi-label classification tasks.
In this paper, we present a multi-level representation learning (GOLANG) framework for self-supervised multi-label classification, which captures the image context and object information simultaneously. Our approach combines global context learning and local alignment to capture different levels of semantic information in images. The global context learning module learns from the whole image, while the local alignment module eliminates object-irrelevant nuisances by learning where to learn.
By integrating both modules, our model effectively learns diverse levels of semantic information to facilitate the multi-label classification task. To further enhance the model's ability to extract object-scene relationships, we introduce cross-level prediction, which effectively captures the intricate interplay between various objects and scenes within images. The GOLANG framework demonstrates state-of-the-art performance on self-supervised multi-label classification tasks, highlighting its effectiveness in capturing the intricate relationships between multiple objects and scenes in images.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv:2002.05709, 2020.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
Jean-Bastien Grill, Florian Strub, Florent Altche, Corentin ´ Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575, 2014.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence ´ Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
Hsieh, Cheng-Yen and Chang, Chih-Jung and Yang, Fu-En and Wang, Yu-Chiang Frank. Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2696-2705, 2023.
Gang Chen, Yangqiu Song, Fei Wang, and Changshui Zhang. Semi-supervised multi-label learning by solving a sylvester equation. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 410–419. SIAM, 2008.
Ning Xu, Yun-Peng Liu, and Xin Geng. Partial multi-label learning with label distribution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6510–6517, 2020.
Yu-Yin Sun, Yin Zhang, and Zhi-Hua Zhou. Multi-label learning with weak label. In Twenty-fourth AAAI conference on artificial intelligence, 2010.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 4, 5
Senthil Purushwalkam and Abhinav Gupta. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. In NeurIPS, 2020.
Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsupervised contrastive learning for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8392–8401, 2021.
L. Huang, S. You, M. Zheng, F. Wang, C. Qian, and T. Yamasaki, “Learning where to learn in cross-view self-supervised learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016
Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), 2038–2048 (2007)
J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333–359, 2011.
Grigorios Tsoumakas and Ioannis Vlahavas. 2007. Random k-labelsets: An ensemble method for multilabel classification. In Proceedings of the 18th European Conference on Machine Learning (ECML’07), Vol. 4701. 406–417.
Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and
Xiaogang Wang. Learning spatial regularization with imagelevel supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5513–5522, 2017.
Ceyuan Yang, Zhirong Wu, Bolei Zhou, and Stephen Lin. Instance localization for self-supervised detection pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3987–3996, 2021.
Yucheng Zhao, Guangting Wang, Chong Luo, Wenjun Zeng, and Zheng-Jun Zha. Self-supervised visual representations learning by contrastive mask prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10160–10169, 2021.
Hao Guo, Kang Zheng, Xiaochuan Fan, Hongkai Yu, and Song Wang. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 729–739, 2019.