簡易檢索 / 詳目顯示

研究生: 陳維均
Chen, Wei-Jyun
論文名稱: 用於精細動作辨識的雙頭預測網路
A Two-head Prediction Network for Fine-Grained Action Recognition
指導教授: 林政宏
Lin, Cheng-Hung
口試委員: 賴穎暉
Lai, Ying-Hui
陳勇志
Chen, Yung-Chih
林政宏
Lin, Cheng-Hung
口試日期: 2021/08/25
學位類別: 碩士
Master
系所名稱: 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 36
中文關鍵詞: 深度學習影像辨識動作辨識
英文關鍵詞: deep learning, image recognition, action recognition
DOI URL: http://doi.org/10.6345/NTNU202101249
論文種類: 學術論文
相關次數: 點閱:137下載:18
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年深度學習發展迅速,不僅2D影像辨識,現在3D動作辨識也受到關注。動作辨識的研究從3D CNN開始,便在許多數據集得到不錯的效果。但大部分的動作辨識網路,在細部動作的辨識上都有改進的空間,原因是細部動作整體來說和一般的動作差異不大,可能只是在一小段時間內發生的差異,因此十分不好判斷。這個情況在籃球比賽十分常見,籃球比賽中常常有各種肢體碰撞,但是這些肢體碰撞並不一定會造成犯規,要辨識這些犯規就勢必得加強細部動作的偵測。由於現在並沒有相關的資料集讓我們做相關的研究,因此我們自己蒐集資料,建立一個籃球犯規的資料集。在本論文中,我們提出了一種提昇細部動作辨識的網路套用在現有的網路上,包括3D-Resnet50[1]、(2+1)D-Resnet50[2]、I3D-50[3]。實驗結果顯示加入這個網路後,在各種模型的準確度上都獲得3~7%的提升。

    In recent years, deep learning has developed rapidly, not only in 2D image recognition, but now 3D action recognition is also attracting attention. The research on action recognition started with 3DCNN, and got good results on many data sets. But most action recognition networks have room for improvement in the recognition of fine-grained actions. The reason is that the fine-grained actions are not much different from the general actions, and may only be differences in a short period of time, so it is very difficult to recognize by current 3D models. This situation is very common in basketball games. There are often various body collisions in basketball games, but these body collisions do not necessarily cause fouls. To identify these fouls, it is necessary to strengthen the detection of fine-grained actions. Since there is no relevant data set for us to do related research, we collect the data ourselves and build a data set of basketball fouls. In this paper, we propose a two-head prediction networks suitable for existing networks, including 3D-Resnet50, (2 +1) D-Resnet50, and I3D-50 to improve the accuracy of action recognition. Experimental results show that after joining the proposed network, the accuracy of various models has been improved in 3~7%.

    誌 謝 i 摘要 ii Abstract iii 目 錄 v 圖目錄 vii 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 1 1.3 研究方法概述 2 1.4 研究貢獻 2 1.5 論文架構 3 第二章 文獻探討 4 2.1 動作辨識的發展 4 2.1.1 Two-Stream 4 2.1.2 3D Convolution 7 2.2 相關文獻探討 10 2.2.1 2D Convolution+ Recurrent neural network類型的細部動作辨識 10 2.2.2 3D Convolution類型的細部動作辨識 12 第三章 研究方法 14 3.1 系統架構 14 3.1.1 Two-head Prediction:Score head and classifier head 15 3.1.2 T-fixed backbone 17 3.2 網路訓練與測試 18 3.3 收集資料 20 3.3.1 選擇資料來源 21 3.3.2 label細節 21 3.3.3 統計資料 23 第四章 實驗結果 25 4.1 實驗配置 25 4.1.1 驗證方法 25 4.1.2 訓練細節 26 4.2 實驗數據 27 第五章 結論與未來展望 28 5.1 結論 28 5.2 未來展望 28 參 考 文 獻 29 自 傳 36 學 術 成 就 36

    [1] Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018.
    [2] Tran, Du, et al. “A closer look at spatiotemporal convolutions for action recognition.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018.
    [3] Carreira, Joao, and Andrew Zisserman. “Quo vadis, action recognition? a new model and the kinetics dataset.” proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
    [4] Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” 2014. arXiv preprint arXiv:1406.2199.
    [5] Tran, Du, et al. “Learning spatiotemporal features with 3d convolutional networks.” Proceedings of the IEEE international conference on computer vision. 2015.
    [6] Wang, Limin, et al. “Temporal segment networks: Towards good practices for deep action recognition.” European conference on computer vision. Springer, Cham, 2016.
    [7] Kay, Will, et al. “The kinetics human action video dataset.” 2017. arXiv preprint arXiv:1705.06950.
    [8] Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.
    [9] Feichtenhofer, Christoph, et al. “Slowfast networks for video recognition.” Proceedings of the IEEE/CVF international conference on computer vision. 2019.
    [10] Chéron, Guilhem, Ivan Laptev, and Cordelia Schmid. “P-cnn: Pose-based cnn features for action recognition.” Proceedings of the IEEE international conference on computer vision. 2015.
    [11] Choutas, Vasileios, et al. “Potion: Pose motion representation for action recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
    [12] Wu, Zuxuan, et al. “Harnessing object and scene semantics for large-scale video understanding.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
    [13] S. Ji, W. Xu, M. Yang and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, Jan. 2013, doi: 10.1109/TPAMI.2012.59.
    [14] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [15] Tran, Du, et al. “Convnet architecture search for spatiotemporal feature learning.” 2017. arXiv preprint arXiv:1708.05038.
    [16] Yue-Hei Ng, Joe, et al. “Beyond short snippets: Deep networks for video classification.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
    [17] Donahue, Jeffrey, et al. “Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
    [18] Sun, Chen, et al. “Temporal localization of fine-grained actions in videos by domain transfer from web images.” Proceedings of the 23rd ACM international conference on Multimedia. 2015.
    [19] Yuan, Jun, et al. “Temporal action localization with pyramid of score distribution features.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
    [20] Yeung, Serena, et al. “End-to-end learning of action detection from frame glimpses in videos.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [21] Yeung, Serena, et al. “Every moment counts: Dense detailed labeling of actions in complex videos.” International Journal of Computer Vision 126.2 (2018): 375-389. 2018.
    [22] Lea, Colin, et al. “Segmental spatiotemporal cnns for fine-grained action segmentation.” European Conference on Computer Vision. Springer, Cham, 2016.
    [23] Singh, Bharat, et al. “A multi-stream bi-directional recurrent neural network for fine-grained action detection.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [24] Piergiovanni, A. J., and Michael S. Ryoo. “Fine-grained activity recognition in baseball videos.” Proceedings of the ieee conference on computer vision and pattern recognition workshops. 2018.
    [25] Piergiovanni, A., Chenyou Fan, and Michael Ryoo. “Learning latent subevents in activity videos using temporal attention filters.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. 2017.
    [26] Zhu, Chen, et al. “Fine-grained video categorization with redundancy reduction attention.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
    [27] Shou, Zheng, Dongang Wang, and Shih-Fu Chang. “Temporal action localization in untrimmed videos via multi-stage cnns.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [28] Xu, Huijuan, Abir Das, and Kate Saenko. “R-c3d: Region convolutional 3d network for temporal activity detection.” Proceedings of the IEEE international conference on computer vision. 2017.
    [29] Shou, Zheng, et al. “Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
    [30] Lin, Cheng-Hung, Min-Yen Tsai, Po-Yung Chou. “A Lightweight Fine-Grained Action Recognition Network for Basketball Foul Detection.” Proceedings of the IEEE International Conference on Consumer Electronics. 2021.
    [31] Lin, Min, Qiang Chen, and Shuicheng Yan. “Network in network.” 2013. arXiv preprint arXiv:1312.4400.
    [32] Kuehne, Hildegard, et al. “HMDB: a large video database for human motion recognition.” 2011 International conference on computer vision. IEEE, 2011.
    [33] Soomro, Khurram, Amir Roshan Zamir, and Mubarak Shah. “UCF101: A dataset of 101 human actions classes from videos in the wild.” 2012. arXiv preprint arXiv:1212.0402.
    [34] Caba Heilbron, Fabian, et al. “Activitynet: A large-scale video benchmark for human activity understanding.” Proceedings of the ieee conference on computer vision and pattern recognition. 2015.
    [35] Gu, Chunhui, et al. “Ava: A video dataset of spatio-temporally localized atomic visual actions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
    [36] Shahroudy, Amir, et al. “Ntu rgb+ d: A large scale dataset for 3d human activity analysis.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [37] Rohrbach, Marcus, et al. “A database for fine grained activity detection of cooking activities.” 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012.
    [38] Sun, Shan, et al. “Taichi: A fine-grained action recognition dataset.” Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. 2017.
    [39] Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014.
    [40] Varol, Gül, Ivan Laptev, and Cordelia Schmid. “Long-term temporal convolutions for action recognition.” IEEE transactions on pattern analysis and machine intelligence 40.6 (2017): 1510-1517. 2017.
    [41] Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. “Convolutional two-stream network fusion for video action recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [42] Qiu, Zhaofan, Ting Yao, and Tao Mei. “Learning spatio-temporal representation with pseudo-3d residual networks.” proceedings of the IEEE International Conference on Computer Vision. 2017.
    [43] Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. “Learning spatio-temporal features with 3d residual networks for action recognition.” Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017.
    [44] Fan, Lijie, et al. “End-to-end learning of motion representation for video understanding.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
    [45] Zheng, Zhenxing, et al. “Global and local knowledge-aware attention network for action recognition.” IEEE transactions on neural networks and learning systems 32.1 (2020): 334-347. 2020.
    [46] Ji, Rong. “Research on basketball shooting action based on image feature extraction and machine learning.” IEEE Access 8 (2020): 138743-138751. 2020.
    [47] Cao, Zhe, et al. “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields.” IEEE transactions on pattern analysis and machine intelligence 43.1 (2019): 172-186. 2019.
    [48] Fang, Hao-Shu, et al. “Rmpe: Regional multi-person pose estimation.” Proceedings of the IEEE international conference on computer vision. 2017.
    [49] Zhu, Yi, et al. “A comprehensive study of deep video action recognition.” 2020. arXiv preprint arXiv:2012.06567.
    [50] Lea, Colin, René Vidal, and Gregory D. Hager. “Learning convolutional action primitives for fine-grained action recognition.” 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016.
    [51] Munro, Jonathan, and Dima Damen. “Multi-modal domain adaptation for fine-grained action recognition.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
    [52] Zagoruyko, Sergey, and Nikos Komodakis. “Wide residual networks.” 2016. arXiv preprint arXiv:1605.07146.
    [53] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” 2014. arXiv preprint arXiv:1409.1556.
    [54] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012): 1097-1105. 2012.
    [55] Zach, Christopher, Thomas Pock, and Horst Bischof. “A duality based approach for realtime tv-l 1 optical flow.” Joint pattern recognition symposium. Springer, Berlin, Heidelberg, 2007.
    [56] Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” 2014. arXiv preprint arXiv:1412.6980.
    [57] H. Wang, A. Kläser, C. Schmid and C. Liu, “Action recognition by dense trajectories,” CVPR 2011, 2011, pp. 3169-3176, doi: 10.1109/CVPR.2011.5995407.

    下載圖示
    QR CODE