簡易檢索 / 詳目顯示

研究生: 蔡旻諺
TSAI, Min-Yan
論文名稱: 應用於人體骨架動作辨識的結合快慢網路與注意力自適性圖卷積架構
Integration of Slow-Fast Network and Attention Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition
指導教授: 林政宏
Lin, Cheng-Hung
口試委員: 陳勇志
Chen, Yung-Chih
賴穎暉
Lai, Ying-Hui
林政宏
Lin, Cheng-Hung
口試日期: 2022/01/18
學位類別: 碩士
Master
系所名稱: 電機工程學系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 41
中文關鍵詞: 動作辨識圖卷積網路特徵融合
英文關鍵詞: Action Recognition, Graph Convolutional Network, Feature Fusion
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202200209
論文種類: 學術論文
相關次數: 點閱:110下載:29
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文探討了圖像動作辨識與骨架動作辨識任務,近年來骨架動作辨識任務被快速的發展,發展出藉由圖卷積神經網路結合鄰接矩陣表達人體結構的方式,尤其注重於在圖卷積神經網路中的跨距離連結能力,並學習不同型態的骨架資訊在大型數據集達到更高的準確率。我們認為比起學習多樣的資料型態,注重動作的解析同樣重要,因此引入圖像動作辨識的雙流方法,使用高頻率與低頻率分別解析單一型態的骨架序列,從而提取不同的靜態與動態動作資訊。同時兩流分別作為兩種對於關節點的連結策略,分別注重間格性時間與相鄰時間的連結,並在不同層中穿插靜態與動態特徵的融合層。我們所提出的架構在大型數據集NTU RGB+D 中的單資料評估為95.9%的準確率,多資料評估為96.8%的準確率。實驗結果證實了,我們所提出的方法達到更好的結果。

    This paper discusses RGB-based action recognition and Skeleton-based action recognition tasks. In recent years, skeleton action recognition tasks have been rapidly developed, and a way of expressing human body structure through graph convolutional neural networks combined with adjacency matrices has been developed, with particular emphasis on the cross-distance connection ability in the graph convolutional neural network, and learn different types of skeleton information to achieve higher accuracy in large data sets. We believe that it is equally important to focus on the analysis of actions instead of learning various data types. Therefore, we introduce a two-stream method for RGB action recognition, using high frequency and low frequency to analyze a single type of skeleton sequence, so as to extract different static and dynamic Action information. At the same time, the two streams are used as two connection strategies for joint points, respectively, focusing on the connection between inter-lattice time and adjacent time, and interspersed with fusion layers of static and dynamic features in different layers. The accuracy of our proposed architecture is 95.9% in single-data evaluation and 96.8% in multi-data evaluation in the large dataset NTU RGB+D. The experimental results confirm that our proposed method achieves better results.

    第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 6 1.3 研究方法概述 7 1.4 研究貢獻 8 1.5 論文架構 9 第二章 文獻探討 10 2.1 動作辨識 10 2.2 圖卷積神經網路 11 2.3 基於圖卷積之骨架動作辨識 12 第三章 研究方法 15 3.1 特徵萃取單元 16 3.2 Slow Fast Structure 20 3.3 快速流與慢速流的特徵融合 21 3.4 實驗設置 23 3.5 骨架資料的處理 24 第四章 實驗結果 26 4.1 消融實驗 26 4.1.1 快速流與慢速流之參數量平衡 26 4.1.2 快慢流採樣速率比 28 4.1.3 特徵融合層設置 29 4.2 大型數據集之實驗 30 4.3 訓練細節與實驗設備 35 第五章 結論與未來展望 36 5.1 結論 36 5.2 未來展望 36 參考文獻 37 自傳 41 學術成就 41

    [1] Heng Wang, and Cordelia Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2013, pp. 3551-3558.
    [2] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, “3D convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, 35(1), 2012, pp. 221-231.
    [3] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1725-1732.
    [4] Simonyan, Karen, and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” in arXiv preprint, arXiv:1406.2199, 2014.
    [5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2015, pp. 4489-4497.
    [6] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 1933-1941.
    [7] Joao Carreira, and Andrew Zisserman, “Quo vadis, action recognition? new models and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6299-6308.
    [8] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 2019, pp. 6202-6211.
    [9] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan, “Learning actionlet ensemble for 3d human action recognition,” IEEE transactions on pattern analysis and machine intelligence, 36(5), 2013, pp. 914-927.
    [10] Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu, “Cross-view action modeling, learning and recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2014, pp. 2649–2656.
    [11] Yong Du, Wei Wang, and Liang Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2015, pp. 1110–1118.
    [12] Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi, “Differential recurrent neural networks for action recognition,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2015, pp. 4041–4049.
    [13] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in European Conference on Computer Vision (ECCV), 2016, pp. 816–833.
    [14] Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee, “Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2017, pp. 1012–1020.
    [15] Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and Alex C Kot, “Skeleton-based human action recognition with global context-aware attention LSTM networks,” in IEEE Transactions on Image Processing, 27(4), 2017, pp.1586-1599.
    [16] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1010–1019.
    [17] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu, “An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data,” in Proceedings of the AAAI conference on artificial intelligence , vol. 31, no. 1, 2017, pp 4263–4270.
    [18] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng, “View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition From Skeleton Data,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2117–2126.
    [19] Mengyuan Liu, Hong Liu, and Chen Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, 68, 2017, pp.346–362.
    [20] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Ahmed Sohel, and Farid Boussad, “A New Representation of Skeleton Sequences for 3d Action Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4570–4579.
    [21] Bo Li, Yuchao Dai, Xuelian Cheng, Huahui Chen, Yi Lin, and Mingyi He, “Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN,” in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) , 2017, pp. 601–604. IEEE.
    [22] Hongsong Wang and Liang Wang, “Modeling temporal dynamics and spatial configurations of actions using two stream recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 499-508.
    [23] Tae Soo Kim and Austin Reiter, “Interpretable 3d human action analysis with temporal convolutional networks,” in 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), 2017, pp. 1623–1631. IEEE.
    [24] Congqi Cao, Cuiling Lan, Yifan Zhang, Wenjun Zeng, Hanqing Lu, and Yanning Zhang, “Skeleton-Based Action Recognition with Gated Convolutional Neural Networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2018, pp. 1–1.
    [25] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao, “Independently recurrent neural network: Building a longer and deeper rnn,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018, pp. 5457–5466.
    [26] Sijie Yan, Yuanjun Xiong, and Dahua Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Thirty-second AAAI conference on artificial intelligence, 2018.
    [27] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu, “Skeleton-based action recognition with directed graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7912–7921.
    [28] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu, “Two stream adaptive graph convolutional networks for skeletonbased action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12026–12035.
    [29] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu, “Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks,” IEEE Transactions on Image Processing, vol. 29, 2020, pp. 9532–9545.
    [30] Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1112–1121.
    [31] Fanfan Ye, Shiliang Pu, Qiaoyong Zhong, Chao Li, Di Xie, and Huiming Tang, “Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp 55–63.
    [32] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, ZhiyongWang, and Wanli Ouyang, “Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 143-152.
    [33] Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu, “Skeleton-Based Action Recognition with Shift Graph Convolutional Network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 183-192.
    [34] Jun Liu, Amir Shahroudy, Mauricio Lisboa Perez, GangWang, Ling-Yu Duan, and Alex Kot Chichung, “NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 42, no. 10, 2020, pp. 2684-2701.
    [35] Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang, “Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1625–1633.
    [36] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu, “Dynamic convolution: Attention over convolution kernels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11030–11039.
    [37] Ke Cheng, Yifan Zhang, Congqi Cao, Lei Shi, Jian Cheng, and Hanqing Lu, “Decoupling gcn with dropgraph module for skeleton-based action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
    [38] Matthew Korban, and Xin Li, “Ddgcn: A dynamic directed graph convolutional network for action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 761–776.
    [39] Yuya Obinata, and Takuma Yamamoto, “Temporal Extension Module for Skeleton-Based Action Recognition,” in arXiv preprint, arXiv:2003.08951, 2020.
    [40] Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, and Qiang Xu, “Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11436-11445.
    [41] Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu, “Channel-Wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13359-13368.
    [42] Zhan Chen, Sicheng Li, Bing Yang, Qinghan Li, and Hong Liu, “Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 1113–1122.
    [43] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan, “An attention enhanced graph convolutional lstm network for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1227–1236.
    [44] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu, “Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation,” in arXiv preprint, arXiv:1804.06055, 2018.
    [45] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid, “Learning clip representations for skeleton-based 3d action recognition,” IEEE Transactions on Image Processing, 2018, pp. 2842–2855.
    [46] Mingxing Tan, Quoc Le, “Efficientnet: Rethinking model scaling for convolutional neural networks. in International Conference on Machine Learning,” PMLR, 2019, pp. 6105-6114.

    下載圖示
    QR CODE