研究生: |
羅安鈞 Luo, An-Chun |
---|---|
論文名稱: |
智慧型演講錄製系統 Smart Lecture Recording System |
指導教授: |
陳世旺
Chen, Sei-Wang |
學位類別: |
博士 Doctor |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 172 |
中文關鍵詞: | 智慧型演講錄製系統 、虛擬攝影師 、虛擬導播 、虛實對位 、選鏡 、視覺指導 、虛實預覽 |
英文關鍵詞: | Smart lecture recording system, Virtual cameraman, Virtual director, Virtual-real match moving, Shot selection, Visual instruction, Preview images |
DOI URL: | https://doi.org/10.6345/NTNU202201891 |
論文種類: | 學術論文 |
相關次數: | 點閱:106 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來由於數位學習(或遠距教學)的發展,從高度發達的大都市到偏遠低開發國家,都可為學習者提供了平等的機會。而演講錄製系統在收集數位學習的內容資料中發揮著至關重要的作用。然而隨著數位學習的蓬勃發展,數位內容的缺乏以及專業錄製團隊人員等正在成為一個大問題。這項研究提出了一個智慧型的演講錄製系統,可以自動錄製與人類團隊相同質量水平的內容,並減少錄製人員不足的問題。
本研究所提出的智慧型演講錄製系統由三個主要元件系統組成,分別稱為虛擬攝影師、虛擬導演和虛實對位。前兩個元件虛擬攝影師和虛擬導演是線上執行的系統,而虛實對位是屬於離線後製的元件。而虛擬攝影師元件可進一步分為三個子系統:演講者攝影師,觀眾攝影師和演講廳攝影師。所有這些子系統都是自動運作,包括選擇拍攝目標、追踪拍攝、特殊事件偵測等功能。這三個子系統拍攝的視訊將全部傳輸到虛擬導演系統,虛擬導演則選擇最具代表性的畫面錄製或直播。我們將虛擬導演的此功能稱為:選鏡。選鏡的功能主要是對來自虛擬攝影師的視訊作內容分析,並通過反傳播神經網絡特徵的機器學習過程進行畫面選擇的決策。此外,虛擬導演系統具有另一個關鍵功能:視覺指導,通過它可以模仿人類導演和現實世界中的人類攝影師之間的溝通。
在完成一段實況的演講錄製後,有時會在演講的錄音集中附加額外的內容或素材,以增加其表現力和可看性。所以本研究另外開發了一個稱為虛實對位系統的後期製作元件,用於實際拍攝影片與虛擬物件的合成。該系統以深度攝影機作為深度感測設備,協助真實世界的彩色攝影機和虛擬世界的攝影機同步對位。虛實對位系統有三個主要執行流程:時間深度融合、攝影機跟踪和虛實合成預覽。由深度相機獲取的深度影像經由時間深度融合被疊合成場景的3D構造。再藉由3D場景的結構與深度攝影機的相對關係,推導出彩色攝影機的移動軌跡。此軌跡則用於引導虛擬攝影機與真實攝影機同步移動完成對位,將虛擬物件投影並生成虛擬影像,將生成的虛擬圖像疊加在由彩色攝影機獲取的真實圖像上,所得到的圖像稱為虛實合一的預覽圖像。
本研究進行了一系列實際演講錄製實驗,而實驗數據顯示我們所提出的智慧型演講錄製系統可以模擬出近似於真正的人類團隊所採取的拍攝、選鏡技術。我們也認為這套系統可不限於演講錄製;如果可以搭配適當的訓練資料,也可以適合錄製舞台表演,音樂會,運動比賽和產品發表會等場合。
Nowadays, e-learning (or distance learning) provides equal opportunities for learners in locations ranging from highly developed metropolises to remote less-developed countries. Lecture recording systems play a vital role in collecting spoken discourse for e-learning. However, in view of the growing development of e-learning, the lack of content is becoming a problem. This research presents a smart lecture recording (SLR) system that can record orations at the same level of quality as a human team, but with a reduced degree of human involvement.
The proposed SLR system is composed of three principal components, referred to as virtual cameraman (VC), virtual director (VD), and virtual-real match moving (VRMM), respectively. The first two components, VC and VD, are online components, whereas the VRMM component is offline. The VC component is further divided into three subsystems: speaker cameraman (SC), audience cameraman (AC), and hall cameraman (HC). All these subsystems are automatic, and can take actions that include target and event detection, tracking, and view searching. The videos taken by these three subsystems are all forwarded to the VD system, in which the representative shot is chosen for recording or direct broadcasting. We refer to this function of the VD system as shot selection. The shot selection function operates based on the content analysis of the videos transmitted from the VC component. The capability of content analysis is pre-trained through a machine-learning process characterized by the counter-propagation neural network. In addition, the VD system possesses another pivotal function of visual instruction, through which it imitates the communication between a human director and human cameramen in the real world.
Having completed a live speech recording, it is often necessary to include additional contents or materials in the shot collection of the speech in order to increase its expressivity and vitality. In this context, we develop a post-production component called the virtual-real match moving (VRMM) system for graphic/ stereoscopic image composition. The input data to this system is provided by the equipment constituting a color camera and a depth camera. There are three major processes: temporal depth fusion, camera tracking, and virtual-real synthesis preview, involved in the VRMM subsystem. During temporal depth fusion, the depth images acquired by the depth camera are fused to lead to a 3D construction of the scene. Based on the constructed scene, the pose of the color camera is determined, which is next used to direct a virtual camera to generate synthetic images of a given 3D object model. The generated images are superimposed upon the real images acquired by the color camera. The resultant images are called preview images.
A series of experiments for real lecture has been conducted. The results showed that the proposed SLR system can provide oration records close to some extend to those taken by real human teams. We believe that the proposed system may not be limited to live speeches; if it can be configured with appropriate training materials, it may also be suitable for recording stage performance, concerts, athletic competitions, and product launches.
[1] L. A. Rowe, D. Harley, P. Pletcher, and S. Lawrence, “BIBS: A Lecture Webcasting System,” Berkeley Multimedia Research Center, 2001.
[2] Y. Rui, L. He, A. Gupta, and Q. Liu, “Building an Intelligent Camera Management System,” Proceedings of the ACM International Conference on Multimedia, vol. 9, pp. 2-11, 2001.
[3] M. Bianchi, “AutoAuditorium: A Fully Automatic, Multi-Camera System to Televise Auditorium Presentations,” Proceedings of the Joint DARPA/NIST Smart Spaces Technology Workshop, 1998.
[4] M. Bianchi, “Automatic Video Production of Lectures Using an Intelligent and Aware Environment,” Proceedings of the International Conference on Mobile and Ubiquitous Multimedia, pp. 117-123, 2004.
[5] G. D. Abowd, “Classroom 2000: An Experiment with the Instrumentation of a Living Educational Environment,” IBM Systems Journal, vol. 38, no. 4, pp. 508-530, 1999.
[6] G. Cruz and R. Hill, “Capturing and Playing Multimedia Events with STREAMS,” Proc. ACM Int’l Conf. on Multimedia, pp. 193-200, 1994.
[7] C. Zhang, Y. Rui, J. Crawford, and L.W. He, “An Automated End-to-end Lecture Capture and Broadcasting System,” Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), vol. 4, no. 1, pp. 2-11, 2008.
[8] R. Yong, G. Anoop, G. Jonathan, and L.W. He, “Automating Lecture Capture and Broadcast: Technology and Videography,” Multimedia Systems, vol. 10, no. 1, pp. 3-15, 2004.
[9] R. Baecker, “A Principled Design for Scalable Internet Visual Communications with Rich Media, Interactivity, and Structured Archives,” Proceedings of the Centre for Advanced Studies on Collaborative research, pp. 16-29, 2003.
[10] M. Onishi and K. Fukunaga, “Shooting the Lecture Scene Using Computer-Controlled Cameras based on Situation Understanding and Evaluation of Video Images,” Proceedings of the International Conference on Mobile and Ubiquitous Multimedia, pp. 781-784, 2004.
[11] C. F. Juang and C. M. Chang, “Human Body Posture Classification by a Neural Fuzzy Network and Home Care System Application,” Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 37, no. 6, pp. 984-994, 2007.
[12] M. Ozeki, Y. Nakamura, and Y. Ohta, “Human Behavior Recognition for an Intelligent Video Production System,” Proceedings of the IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, pp. 1153-1160, 2002.
[13] K. H. Cheng, C. H. Hsieh, C. C. Wang, “Human Action Recognition Using 3D Body Joints,” Proceedings of the International Conference on Computer Vision, Graphics, and Image Processing (IPPR), session D2-2, 2011.
[14] S. Y. Lin, Z. H. You, and Y. P. Hung, “A Real-Time Action Recognition Approach with 3D Tracked Body Joints and Its Application,” Proceedings of the International Conference on Computer Vision, Graphics, and Image Processing (IPPR), session B5-2, 2011.
[15] C. M. Huang, Y. R. Chen, and L. C. Fu, “Visual Tracking of Human Head and Arms Using Adaptive Multiple Importance Sampling on a Single Camera in Cluttered Environments,” IEEE Transactions on Sensors, vol. 14, no. 7, pp. 2267-2275, 2014.
[16] C. T. Lu and S.W. Chen, “Automatic Lecture Recording System,” Proceedings of the International Conference on Computer Vision, Graphics, and Image Processing (IPPR), session D1-3, 2011.
[17] C. Zhang, Y. Rui, L. He, and M. Wallick, “Hybrid speaker tracking in an automated lecture room,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 81-84, 2005.
[18] T. Yokoi, and H. Fujiyoshi,”Virtual camerawork for generating lecture video from high resolution images,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 751-754, 2005.
[19] Q. Huang, Y. T. Cui, and S. Samarasekera, ” Content based active video data acquisition via automated cameramen,” Proceedings of the IEEE International Conference on Image Processing (ICIP), p.p.808-812, 1998.
[20] M. Wallick, Y. Rui, and L. He “A portable solution for automatic lecture room camera management,” Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 987-990, 2004.
[21] T. Y. Li and X. Y. Xiao, “An Interactive Camera Planning System for Automatic Cinematographer,” Proceedings of the IEEE International Conference on Multimedia Modelling, 2005.
[22] F. Lampi, S. Kopf, M.Benz, and W. Effelsberg, “An automatic cameraman in a lecture recording system,” Proceedings of the International Workshop on Educational Multimedia and Multimedia Education, p.p. 11-18, 2007.
[23] Y. Cheng, “Mean Shift, Mode Seeking, and Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790-799, 1995.
[24] P. Ekman and W. V. Friesen, “Nonverbal Behavior and Psychopathology,” The psychology of Depression, pp. 203-233, 1969.
[25] M. Gleicher and J. Masanz, “Towards Virtual Videography,” Proceedings of the ACM International Conference on Multimedia, pp. 375-378, 2000.
[26] S. Okuni, S. Tsuruoka, G. P. Rayat, H. Kawanaka, and T. Shinogi, “Video Scene Segmentation Using the State Recognition of Blackboard for Blended Learning,” International Conference on Convergence Information Technology, pp. 2437-2442, 2007.
[27] M. Kumano, Y. Ariki, M. Amano, K. Uehara, “Video Editing Support System Based on Video Grammar and Content Analysis,” Proceedings of the International Conference on Pattern Recognition (ICPR), vol. 2, pp. 1031-1036, 2002.
[28] T. Wang, A. Mansfield, R. Hu, and J. Collomosse, “An Evolutionary Approach to Automatic Video Editing,” Proceedings of the International Conference on Visual Media Production (CVMP), pp. 127-134, 2009.
[29] E. Machnicki and L. Rowe, “Virtual Director: Automating a Webcast,” SPIE Multimedia Computer Network, pp. 208-225, 2002.
[30] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-Aware Saliency Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 10, pp. 1915-1926, 2012.
[31] C. J. Fang, S. W. Chen, and C. S. Fu, “Automatic Change Detection of Driving Environments in a Vision-Based Driver Assistance System,” IEEE Transactions on Neural Networks, vol. 14, no.3, pp.646-657, 2003.
[32] F. Wang, C. W. Ngo, and T. C. Pong, “Synchronization of Lecture Videos and Electronic Slides by Video Text Analysis,” Proceedings of the ACM International Conference on Multimedia, pp. 315-318, 2003.
[33] S. Fiori, “A theory for learning based on rigid bodies dynamics,” IEEE Transactions on Neural Networks, vol. 13, no. 3, pp. 521-531, 2002.
[34] Y. H. Hsiao and C. C. Chen, “A Sparse Sample Collection and Representation Method Using Re-weighting and Dynamically Updating OMP for Fish Tracking, “ Proceedings of the IEEE International Conference on Image Processing, pp. 3494-3497, 2016.
[35] B. F. Wu, C. T. Lin, and C. J. Chen, “Real-time Lane and Vehicle Detection Based on a Single Camera Model, ” International Journal of Computers and Applications, vol. 32, no.2, pp. 149-159, 2010.
[36] R. Setiono, W. K. Leow, and J.M. Zurada, “Extraction of rules from artificial neural networks for nonlinear regression,” IEEE Transactions on Neural Networks, vol 13, no.3, pp. 564-577, 2002.
[37] C. C. Balázs, “Approximation with Artificial Neural Networks,” Faculty of Sciences, Eötvös Loránd University , 2001.
[38] Y. Bengio, “Learning Deep Architectures for AI,“ Foundations and Trends in Machine Learning. vol.2, no.1, pp.1-127, 2009.
[39] Y. Bengio, A. Courville, and P. Vincent, ”Representation Learning: A Review and New Perspectives,“ IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no.8, pp. 1798-1828, 2013.
[40] L. Deng, and D. Yu, ”Deep Learning: Methods and Applications, ” Foundations and Trends in Signal Processing. vol.7 no.3-4, pp.1-199, 2014.
[41] J. Schmidhuber, “Deep Learning in Neural Networks: An Overview, ” Neural Networks, vol. 61 pp.85-117, 2015.
[42] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures, ” 2012.
[43] C. Metz, “Facebook's 'Deep Learning' Guru Reveals the Future of AI, ” Wired, 2013.
[44] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition, ” Neural Computation, vol.1, pp. 541-551, 1989.
[45] Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard, “Handwritten digit recognition: Applications of neural net chips and automatic learning, ” IEEE Transactions on Communication, p.p. 41-46, 1989.
[46] Y. LeCun, L. Bottou, and Y. Bengio, “Reading checks with graph transformer networks, ” International Conference on Acoustics, Speech, and Signal Processing, vol. 1, p.p.151-154, 1997.
[47] D. Scherer, A. C. Müller, and S. Behnke, “Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition, ” International Conference on Artificial Neural Networks (ICANN), pp. 92-101.2010.
[48] S. Lawrence, C. G. Lee, A. C. Tsoi, and A. D. Back, “Face Recognition: A Convolutional Neural Network Approach, ” IEEE Transactions on Neural Networks, vol.8, no.1, p.p. 98-113, 1997.
[49] N. Srivastava, C. G. Hinton, A. Krizhevsky, I. Sutskever; and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from overfitting, ” Journal of Machine Learning Research. vol.15, no.1, p.p. 1929-1958, 2014.
[50] P. LeCallet, V. G. Christian, and B. Dominique, “A Convolutional Neural Network Approach for Objective Video Quality Assessment, ” IEEE Transactions on Neural Networks, vol.17, no.5, p.p. 1316-1327, 2006.
[51] A. Krizhevsky, I. Sutskever,and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks, ” International Conference on Neural Information Processing Systems (NIPS), 2012.
[52] D. Ciresan, M. Ueli, M. Jonathan, M.G. Luca, and S. Jurgen, “Flexible, High Performance Convolutional Neural Networks for Image Classification, ” Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, vol.2, p.p.1237-1242, 2011.
[53] S.W. Ji, W. Xu, M. Yang, and K. Yu, Kai, “3D Convolutional Neural Networks for Human Action Recognition, ” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, no.1, p.p. 221-231, 2013.
[54] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning, ” Nature, vol.521 ,no.7553, p.p. 436-444, 2015.
[55] J. Baker, L. Deng, J. Glass, S. Khudanpur, C.H. Lee, N. Morgan, and D. O'Shaughnessy, “Research Developments and Directions in Speech Recognition and Understanding, Part 1, ” IEEE Signal Processing Magazine, vol.26, no.3, p.p.75-80, 2009.
[56] G.Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.Sainath, and B.Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition --- The shared views of four research groups,” IEEE Signal Processing Magazine, vol.29, no.6, p.p. 82-97, 2012.
[57] R Collobert and J. Weston,“A unified architecture for natural language processing: Deep neural networks with multitask learning, ” Proceedings of the ACM International Conference on Machine learning, 2008.
[58] C. Y. Chen, A. Seff, A. Kornhauser, J.X. Xiao, “DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving, ” IEEE International Conference on Computer Vision (ICCV), pp. 2722-2730, 2015.
[59] B. Firner, B. Flepp, K. Zieba, L. Jackel, M. Bojarski, and U. Muller,“End-to-End Deep Learning for Self-Driving Cars, ” Parallel Forall. NVIDIA, 2016.
URL: https://devblogs.nvidia.com/parallelforall/deep-learning-self-driving-cars/
[60] R. Hadsell, A. Erkan, P. Sermanet, M. Scoffier, U. Muller and Y. LeCun, ” Deep belief net learning in a long-range vision system for autonomous off-road driving,” IEEE International Conference on Intelligent Robots and Systems, p.p. 628 – 633, 2008.
[61] B. Kisačanin, “Deep Learning for Autonomous Vehicles,” IEEE International Symposium on Multiple-Valued Logic (ISMVL), p.p. 142-142, 2017.
[62] R. Hecht-Nielsen, “Counterpropagation Networks,” Applied Optics, vol. 26, no. 23, pp. 4979-4983, 1987.
[63] M. G¨onen and E. Alpaydın, “Multiple Kernel Learning Algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211-2268, 2011.
[64] Y. Y. Lin, T. L. Liu, and C. S. Fuh, “Multiple Kernel Learning for Dimensionality Reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1147-1160, 2011.
[65] M. Gleicher and J. Masanz, “Towards Virtual Videography,” Proceedings of the ACM International Conference on Multimedia, pp. 375-378, 2000.
[66] Q. Liu, Y. Rui, A. Gupta, and J. J. Cadiz, “Automating Camera Management for Lecture Room Environments,” Proceedings of the SIGCHI International Conference on Human Factors in Computing Systems, Seattle, Washington, USA, pp. 442-449, 2001.
[67] T. Dobbert, “Matchmoving: The Invisible Art of Camera Tracking,” Sybex, 2005, ISBN 0-7821-4403-9.
[68] The Pixel Farm PFTrack http://www.thepixelfarm.co.uk/product.php?productId=PFTrack
[69] S. Y. Bao and S. Savarese, “Semantic Structure from Motion,” Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8, 2011.
[70] A. J. Davison, “Real-Time Simultaneous Localisation and Mapping with a Single Camera,” Proceedings of the International Conference on Computer Vision (ICCV), 2003.
[71] G. Klein and D. Murray, “Parallel Tracking and Mapping for Small AR Workspaces,” Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR), 2007.
[72] A. J. Davison, “SLAM++: Simultaneous Localization and Mapping at the Level of Objects.” Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
[73] P. Johann and R. Hamböker, “Parametric statistical theory,” Walter de Gruyter, Berlin, DE. pp. 207-208. ISBN 3-11-013863-8. 1994.
[74]M. O. Robin and D. Scott, “Finite automata and their decision problems,” IBM Journal of Research and Development. vol., 3 no.2, pp.114-125, 1959.
[75]E. Rosten and T. Drummond, “Fusing points and lines for high performance tracking,” Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2. pp. 1508-1511, 2005.
[76] B. D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” Proceedings of Imaging Understanding Workshop, pp. 121-130, 1981.
[77] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Y. Shum, “Learning to Detect a Salient Object,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353-367, 2011.
[78] G. Abdollahian, C. M. Taskiran, Z. Pizlo, and E. J. Delp, “Camera Motion-Based Analysis of User Generated Video,” IEEE Transactions on Multimedia, Vol. 12, No. 1, 2010
[79] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, A. Fitzgibbon, “KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera,” Proceedings of the ACM Symposium on User Interface Software and Technology, pp. 559-568, 2011.
[80] S. Rusinkiewicz and M. Levoy, “Efficient variants of the ICP algorithm,” International Conference on 3D Digital Imaging and Modeling, pp. 145-152, 2001.