簡易檢索 / 詳目顯示

研究生: 李奕男
Lee, Yi-Nan
論文名稱: 從多標籤圖像學習之深層視覺語意轉換模型
Deep Visual Semantic Transform Model Learning from Multi-Label Images
指導教授: 葉梅珍
Yeh, Mei-Chen
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 42
中文關鍵詞: 卷積類神經網路視覺語意嵌入模型多標籤圖像多標籤分類問題
英文關鍵詞: Convolutional neural network, visual semantic embedding model, multi-label image, multi-label classification problem
DOI URL: https://doi.org/10.6345/NTNU202202567
論文種類: 學術論文
相關次數: 點閱:186下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在機器學習與電腦視覺領域中,如何學習圖像與文字語意之間的關係一直都是重要的議題。本論文探討圖像與文字關連性的問題,首先,每個文字之間是具有語意關係的,例如:天空跟雲這兩個字語意上靠近的,或是天空與汽車在語意上是幾乎不相關的。但是使用者對每個文字之間的語意關係是否會根據圖像會有所不同?例如:一張有天空與汽車的圖像,「天空」與「汽車」這兩個字原本就語意上可以說是幾乎不相關的,但因為此圖而產生了關連性。因此,我們認為文字間的語意關係會因為不同的圖像而改變其關聯程度。我們提出了一個卷積類神經網路(Convolutional Neural Network)的模型來連結圖像與該圖像多個的文字標籤的語意關係,其輸入為圖像,和現有的視覺語意嵌入模型最大的不同在於該模型的輸出為一個線性轉換函數,將輸入圖像對應到一個函數,用以判斷文字對該圖像的相關性,進而為圖像預測可能的標籤。

    Learning the relation between images and text semantics has been an important problem in the field of machine learning and computer vision. This paper addresses this problem. We observe that there is a semantic relation between texts, for example, “sky” and “cloud” have a close semantic relation, and “sky” and “car” have a weak semantic relation. We suppose the semantic relation between texts can be different depending on images. For example, an image contains both sky and car. The word “sky” and “car” are initially semantically irrelevant, but may have a connection because of the image containing these concepts. Therefore, we propose a Convolutional Neural Network based model to link the semantic relation between an image and its text labels. The main difference between our work and existing visual semantic embedding models is that the output of our model is a linear transformation function. In other words, each input image is treated as a function to determine the relation between each word and the image, and to predict the possible labels for the image. Finally, this model is validated on the NUS-WIDE dataset and the experimental results show that the model has a great performance on predicting labels for images.

    附表目錄 iv 附圖目錄 v 第一章 簡介 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 論文架構 4 第二章 相關研究探討 5 2.1 圖像模型 5 2.2 文字語意嵌入模型 8 2.3 圖像結合語意模型 10 第三章 方法與步驟 14 3.1 模型設計理念 14 3.2 模型架構介紹 15 3.3 損失函數介紹(Loss function) 17 3.4 模型總結 19 第四章 實驗結果分析 20 4.1 模型設置 20 4.2 資料集介紹 20 4.3 實驗一(tags81) 22 4.4 實驗二(tags1k) 28 4.5 實驗三(探討轉換函數之參數) 29 4.6 實驗四(模型以及結果討論) 31 4.7 圖像範例 36 第五章 結論與未來工作 40 參考文獻 41

    [1] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In NIPS, 2013.
    [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
    [3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    [4] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
    [5] R. JeffreyPennington and C. Manning. Glove: Global vectors for word representation. 2014.
    [6] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. CNN-RNN: A unified framework for multi-label image classification. In CVPR, 2016.
    [7] Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille.Multi-instance visual-semantic embedding. In arXiv preprint arXiv:1512.06963, 2015.
    [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In arXiv preprint arXiv:1408.5093, 2014.

    [9] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, A tutorial on energy-based learning, in Predicting Structured Data, Eds. MIT Press, 2006.
    [10] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, 2009.
    [11] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabel image annotation. In arXiv preprint arXiv:1312.4894, 2013.
    [12] Z. Lin, G. Ding, M. Hu, Y. Lin, and S. S. Ge. Image tag completion via dual-view linear sparse reconstructions. Computer Vision and Image Understanding, 124:42–60, 2014.
    [13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. CoRR,abs/1409.0575, 2014.

    無法下載圖示 本全文未授權公開
    QR CODE