簡易檢索 / 詳目顯示

研究生: 游斯涵
論文名稱: 使用機器學習方法於語音文件檢索之研究
Exploiting Machine Learning Methods for Spoken Document Retrieval
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 134
中文關鍵詞: 資訊檢索排序學習語音辨識
英文關鍵詞: Information Retrieval, Learning to Rank, Speech Recognition
論文種類: 學術論文
相關次數: 點閱:224下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文初步地討論機器學習之方法在資訊檢索上的應用,即所謂排序學習(Learning to Rank);並針對近年被使用在資訊檢索上的各種機器學習模型及概念,以及所使用的各種特徵,包含詞彙本身之特徵、相近度特徵、及機率特徵等進行分析與實驗。除此之外,本論文亦將之延伸至語音文件檢索的應用上。本論文初步地使用TDT(Topic Detection and Tracking)中文語料部份作為實驗題材,此語料為過去TREC(文件檢索暨評測會議)上公開評估語音文件檢索系統的標準語料(Benchmark)之一,此語料包含TDT-2及TDT-3兩套語料,提供了大量的新聞語料,及豐富的主題、轉寫等標註,以作為語音文件檢索相關研究使用。為了更有效地開發富含資訊的語音文件特徵,本論文亦使用臺師大大陸口音中文大詞彙連續語音辨識器(Large Vocabulary Speech Recognition, LVCSR)作為語音文件轉寫平台,產生的詞圖(Word Graph),作為擷取語音文件獨特特徵的主要依據。此外,我們並考慮到資訊檢索中之訓練語料不平衡問題,並提出解決此問題之對策。最後,初步的實驗結果顯示,成對式訓練方法RankNet之訓練模型檢索成效較逐點式訓練方法SVM之訓練模型檢索成效為佳。

    This thesis investigates the use of machine-learning approaches, namely learning-to-rank algorithms, for information retrieval (IR), with special emphasis on their theoretical foundations and the associated features that are used by them, such as the lexical features, proximity features, and probabilistic features. Meanwhile, we also consider the application of these approaches for spoken document retrieval (SDR). All experiments were conducted on the Topic Detection and Tracking corpora (especially, TDT-2 and TDT-3), which are the benchmark collections widely adopted for various SDR evaluations since they contain tens of hours of mainland-accented Chinese broadcast news documents equipped with topic labels and orthographic transcripts. In the hope of discovering more useful speech-related features for SDR as well as analyzing the problems caused by speech recognition errors, a large vocabulary speech recognition (LVCSR) system that can output a word lattice consisting of multiple recognition hypotheses for each broadcast news document is established. Moreover, we also deal with the problem of training the machine-learning retrieval models with unbalanced training data, and propose a remedy for it. Finally, the preliminary experimental results seem to show that the RankNet based retrieval model outperforms the support vector machine (SVM) based retrieval model for the SDR task studied in this thesis.

    1. 緒論………………………………………………………………..1 1.1 研究背景……………………………………………………………………1 1.2 資訊檢索於多種資訊型態之應用………………………………………....3 1.3 語音文件搜尋研究之介紹…………………………………………………6 1.4 本論文研究內容與貢獻……………………………………………………9 1.5 研究內容架購………………………………………………………………9 2. 文獻探討…………………………………………………………11 2.1 排序學習(LEARNING TO RANK)……………………………………………11 2.1.1 逐點式訓練(POINT-WISE TRAINING)…………………………………...……13 2.1.2 成對式訓練(PAIR-WISE TRAINING)………………………………………….14 2.1.3 序列式訓練(List-wise Training)…………………………………………16 2.2 支援向量機(SUPPORT VECTOR MACHINE)………………………………...16 3. 資訊檢索架構與問題論述………………………………………23 3.1 LEARNING TO RANK在資訊檢索上的方法………………………………24 3.2 評估工具…………………………………………………………………..24 3.3 實驗語料…………………………………………………………………..27 3.4 特徵選取…………………………………………………………………..29 3.4.1 低階特徵(Low-level Features)…………………………………………..29 3.4.2 相近度特徵(Proximity Features)………………………………………33 3.4.3 機率模型(Probabilistic Features)………………………………………40 3.5 支援向量機工具及其參數選定與均化步驟……………………………45 3.6 支援向量機在資訊檢索之實驗…………………………………………..47 3.6.1 初步實驗結果……………………………………………………………47 3.6.2 問題討論…………………………………………………………………49 4. 改進對策…………………………………………………………55 4.1 成對式訓練 - 排序網路(RANKNET)……………………………………55 4.2 訓練語料不平衡問題的解決策略………………………………………..58 4.2.1 增加正例訓練資料的數量 (Up-Sampling)……………………………..60 4.2.2 減少反例訓練資料的數量 (Down-Sampling)………………………….62 4.2.3 更新方法流程…………………………………………………………....65 5. 語音文件檢索……………………………………………………67 5.1 DRAGON大詞彙語音辨識器……………………………………………...67 5.2 臺師大大陸口音中文大詞彙連續語音辨識系統………………………67 5.2.1 前端處理(Front-end Processing)………………………………………...67 5.2.2 聲學模型(Acoustic Model)………………………………………………68 5.2.3 詞典建立(Lexicon construction)………………………………………68 5.2.4 詞彙樹複製搜尋(Tree-copy Search)…………………………………….68 5.3 語音文件檢索流程………………………………………………………70 5.4 個別特徵在語音文件上的檢索效能……………………………………..71 6. 實驗結果與討論…………………………………………………77 6.1 逐點式訓練在語音文件上的檢索………………………………………..77 6.1.1 SVM在Dragon語音辨識器轉寫之語音文件的檢索效能……………...77 6.1.2 SVM在臺師大大陸口音中文大詞彙語音辨識器轉寫之語音文件的檢索效能……………………………………………………………………………..85 6.2 成對式訓練在語音文件上的檢索………………………………………..90 6.2.1 RankNet在語音正確轉寫上的檢索效能………………………………90 6.2.2 RankNet在Dragon辨識器轉寫之語音文件的檢索效能……………….94 6.2.3 RankNet在臺師大大陸口音中文大詞彙語音辨識器轉寫之語音文件的檢索效能………………………………………………………………………..97 6.3 成對式訓練與平均精確率之關係………………………………………101 6.4 使用更新方法解決不平衡語料問題之實驗……………………………102 7. 結論……………………………………………………………..107 8. 未來展望………………………………………………………..109 9. 參考文獻………………………………………………………..111

    [Baeza-Yates & Ribeiro-Neto 1999] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval, 1999.
    [Bai et al. 2000] B. R. Bai, B. Chen, H.-M. Wang. Syllable-based Chinese text/spoken document retrieval using text/speech Queries. International Journal of Pattern Recognition and Artificial Intelligence, 14(5), pp. 603-616, August 2000.
    [Bashir et al. 2002] Faisal I. Bashir, Ashfaq A. Khokhar. Video Content Modeling: An Overview. Technical Report, Department of CS/ECE, UIC, 2003.
    [Berger 2001] Berger, A. Statistical Machine Learning for Information Retrieval. Doctoral Thesis, Carnegie Mellon University.
    [Boser et al. 1992] B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifier. In Proc. 5th ACM Workshop on Computational Learning Theory, pp. 144-152, Pittsburgh, PA, July 1992.
    [Brin & Page 1998] S. Brin, and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Seventh International World-Wide Web Conference (WWW), April 14-18, Brisbane, Australia, 1998.
    [Burges et al. 2005] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pp. 89-96, New York, NY, USA, 2005.
    [Burges et al. 2007] Christopher J.C. Burges, Robert Ragno and Quoc Viet Le. Learning to Rank with Nonsmooth Cost Functions. In Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, MIT Press, 2007.
    [Cao et al. 2007] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to Rank: From pairwise approach to listwise approach. In Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In ICML '07: Proceedings of the 24th international conference on Machine learning, pp. 129-136, New York, NY, USA, 2007.
    [Carbonell & Goldstein 1998] Jaime G. Carbonell, Jade Goldstein: The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proc. SIGIR’98, pp. 335-336, 1998
    [Chang 1997] Shih-Fu Chang. Content-Based Indexing and retrieval of visual information. IEEE Signal Processing Magazine, 14(4), pp. 45-48, July 1997.
    [Chang & Lin 2001] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
    [Chor et al. 1998] B. Chor, O. GoldReich, E. Kushilevitz. and M. Sudan. Private Information Retrieval. Journal of the ACM, Vol. 45, No. 6, pp. 965–982. 1998.
    [Chen et al. 2004a] B. Chen, H.-M. Wang, and L.-S. Lee, A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken documents. ACM Transactions on Asian Language Information Processing, Vol. 3, No. 2, pp. 128-145, June 2004.
    [Chen et al. 2004b] B. Chen, J.-W. Kuo, and W.-H. Tsai, Lightly supervised and data-driven approaches to mandarin broadcast news transcription. In Proc ICASSP, 2004.
    [Chen et al. 2005] B. Chen, J.-W. Kuo, W.-H. Tsai, Lightly supervised and data-driven approaches to mandarin broadcast news transcription. Internation Journal of Computational Linguistics & Chinese Language Processing,, Vol. 10, No. 1, pp 1-18, 2005.
    [Chen 2006] B. Chen. Exploring the use of latent topical information for statistical Chinese spoken document retrieval. Pattern Recognition Letters, Vol. 27, Issue 1, pp. 9-18, January 2006.
    [Chen 2006] B. Chen. Voice retrieval of Mandarin broadcast news speech. International Journal of Pattern Recognition and Artificial Intelligence, Vol. 20, No. 1, pp. 91-109, February 2006.
    [Chien & Wu 2009] J.-T. Chien & M.-S. Wu. Minimum rank error language modeling. IEEE TASL. Vol. 17, No. 2, 2009.
    [Clements et al. 2002] M. Clements, S. Robertson, and M. Miller. Phonetic searching applied to on-line distance learning modules. In Proceedings of 2002 IEEE 10t Digital Signal Processing Workshop, 2002, and the 2nd Signal Processing Education Workshop, pp. 186–191, 2002.
    [Cortes & Vapnik 1995] C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20, pp. 1-25, 1995.
    [Diaconis 1998] P. Diaconis. Group Representation in probability and statistics. In IMS Lecture Series, No. 11, Institute of Methmatical Statistics, 1988.
    [Drucker et al. 1999] H. Drucker, D. Wu, and V. N. Vapnik. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10, pp. 1048-1054, 1999.
    [Flickner et al. 1995] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, and Peter Yanker. Query by image and video content: the QBIC system. IEEE Computer 28(9), pp. 23-32, 1995.
    [Frey & Dueck 2007] B. J. Frey, and D. Dueck. Clustering by passing messages between data points. Science, 315, pp. 972-976, 2007.
    [Furnas et al. 1988] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum. Information retrieval using a singular value decomposition model of latent semantic structure. In SIGIR ’88: Proceeding of the 11th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 465-480, 1988.
    [Garofolo et al. 2000] J. Garofolo, G. Auzanne, and E. Voorhees. The TREC spoken document retrieval track: A success story. In Proceedings of the Ninth Text Retrieval Conference (TREC-9), National Institute of Standards and Technology (NIST), 2000.
    [Geng et al. 2007] X. Geng, T.-Y. Liu, T. Qin, H. Li. Feature Selection for Ranking. In Proceedings of the 30nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 407-414, 2007.
    [Geng et al. 2008] X. Geng, T.-Y. Liu, T. Qin, A. Arnold, and H. Li, H.-Y. Shum. Query dependent ranking using K-Nearest neighbor. In Proc. SIGIR ’08, pp. 115-122, 2008.
    [Goodrum 2000] Abby A. Goodrum. Image Information Retrieval: An Overview of Current Research. Informing Science, Vol. 3, No. 2, 2000.
    [Hardy et al. 2002] H. Hardy, N. Shimizu, T. Strzalkowski, L. Ting, X. Zhang, G.. Bowden Wise. Cross-document summarization by concept classification. In SIGIR ’02: Proceeding of the 29th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 121-128, 2002.
    [Herbrich et al. 2000] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pp. 115-132, 2000.
    [Hsu et al. 2008] C.-W. Hsu, C.-C. Chang, C.-J. Lin. A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, 2003. http://www.csie.ntu.edu.tw/ cjlin/libsvm/. 2008.
    [Järvelin & Kekäläinen 2002] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transaction on Information Systems, 20(4), pp. 422-446, 2002.
    [Joachims 1997] T. Joachims. A probabilistic analysis of the Rocchio Algorithm with TFIDF for categorization. In Proc. ICML’97, 1997.
    [Joachims 1998] T. Joachims. Text categorization with support vector machines: learning with features. In Proceedings of European Conference on Machine Learning, pp. 137-142, 1998.
    [Kendall & Gibbons 1990] M. Kendall and J. D. Gibbons. Rank Correlation Methods. Edward Arnold, London, 1990.
    [Keselj 1997] V. Keselj. Natural language parsing for internet information retrieval. In Proceedings of 1997 TRIO/ITRC Researcher Retreat, 1997.
    [Kleinberg 1999] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), pp. 604-632, 1999.
    [Li 2007] P. Li, C.J.C. Burges, and Q Wu. McRank:Learning to rank using multiple classification and gradient boosting. In NIPS2007, 2007.
    [Liu et al. 2007] T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. LETOR:Benchmark dataset for research on learning to rank for information retrieval. In Proc. SIGIR 2007 Workshop on Learning to Rank for Information Retrieval.
    [Liu 2008] T.-Y. Liu. Learning to rank for information retrieval. Tutorial at 17th International World-Wide Web Conference (WWW), 2008.
    [Luenberger 1984] D. G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley, 1984.
    [Luhn 1958] H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, pp. 159-165, 1958.
    [MacQueen 1967] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1, pp. 281-297. 1967.
    [Mamou et al. 2006] J. Mamou, D. Carmel, and R. Hoory. Spoken document retrieval from call-center conversations. In SIGIR ’06: Proceeding of the 29th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 51-58, 2006.
    [Manning et al. 2007] C. D. Manning, P. Raghavan, H. Schütze. An Introduction to Information Retrieval. Cambridge University Press, Cambridge, England, 2007.
    [Masand et al. 1992] B. Masand, G. Linoff, and D. Waltz. Classifying news stories using memory based reasoning. In Proc. SIGIR ’92, pp. 59-65, 1992.
    [Meng et al. 2007] S. Meng, P. Yu, F. Seide, and J. Liu. A Study of Lattice- Based Spoken Term Detection. In Proc. ASRU ’07, 2007.
    [Meteer et al. 1991] M. Meteer, R. Schwartz, and R. Weischedel.. Studies in part of speech labeling. In Proceedings of the 4th DARPA Speech and Natural Language Workshop (pp. 331-336). San Mateo, CA: Morgan-Kaufmann, 1991.
    [Miller et al. 1999] David R. H. Miller, Tim Leek, Richard M. Schwartz. A Hidden Markov Model Information Retrieval System. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 214-221, 1999.
    [Moffat & Zobel 2008] A. Moffat, and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, Vol.27, No. 1, 2008.
    [Nallapati 2004] R. Nallapati. Discriminant models for information retrieval. In SIGIR ’04: Proceeding of the 27th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 64-71, UK, 2004.
    [NIST 2006] The Spoken Term Detection (STD) 2006 Evaluation Plan. NIST, 2006.
    [Ortmanns et al. 1997] S. Ortmanns, H. Ney, X. Aubert, A word graph algorithm for Large Vocabulary continuous speech recognition. Computer Speech & Language. Pp. 43-72. 1997.
    [Otsuji et al. 1991] K. Otsuji, Y. Tonomura, and Y. Ohba. Video browsing using brightness data. In Proc. SPIE/IS&T VCIP’91, Vol. 1606, pp. 980–989, 1991.
    [Petkovic et al. 2002] M. Petkovic, R. Zwol, H. E. Blok, W. Jonker, P. M. G. Apers, M. Windhouwer, M. Kersten. Content-based Video Indexing for the Support of Digital Library Search. In Proc. 18th IEEE International Conference on Data Engineering (ICDE), San Jose, USA, February 2002.
    [Rijsbergen 1979] C. J. van Rijsbergen. Information retrieval. Butterworths, 1979.
    [Roberson et al. 1976] S. E. Roberson and K. Sparck Jones. Relevance weighting of search terms. Journal of American Society for Information Sciences, 27(3), pp. 129-146, 1976.
    [Roberson et al. 1995] S. E. Roberson, S. Walker, S. Jones, M. M. IIancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proceedings of the Third Text Retrieval Conference (TREC), 1995.
    [Rocchio 1971] J. J. Rocchio. Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313-323, Englewood Cliffs, NJ: Prentice-Hall, 1971.
    [Salton 1968] G. Salton, and M. E. Lesk. Computer evaluation of indexing and text processing. Journal of the Association for Computing Machinery, Vol. 15, No. 1, pp. 8-36, 1968.
    [Salton 1968] G. Salton. Automatic Information Organization and Retrieval. New York: McGraw-Hill, 1968.
    [Salton & Buchley 1988] G. Salton and C. Buchley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, Vol. 24, No. 5, pp. 513-523, 1988.
    [Saraclar et al. 2004] M. Saraclar, R. Sproat. Lattice-based Search for Spoken Utterance. In Proc. HLT’04, Boston, 2004
    [Sebe et al. 2003] N. Sebe, Michael S. Lew, Arnold W.M. Smeulders. Video retrieval and summarization. Computer Vision and Image Understanding, 2003.
    [Singhal et al. 1996] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR ’96: Proceeding of the 11th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 21-29, 1996.
    [Tsai et al. 2007] M.-F. Tsai, T.-Y. Liu, T. Qin, H.-H. Chen, W.-Y. Ma. FRank: a ranking method with fidelity loss. In SIGIR ’96: Proceeding of the 11th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 21-29, 2007.
    [Vapnik 1995] V. Vapnik. The Nature of Statistical Learning Theory, Springer-Verlag, London, 1995
    [Wagstaff et al. 2001] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-means Clustering with Background Knowledge. In Proceedings of the Eighteenth International Conferece on Machine Learning, ICML, pp. 577-584, 2001.
    [Xu et al. 2006] J. Xu, Y. Cao, H. Li, and Y. Huang. Cost-sensitive Learning of SVM for Ranking. In Proc. ECML, pp. 833-840, 2006.
    [Yue et al. 2007] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. In SIGIR ’07: Proceeding of the 30th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 271-278, 2007.
    [Zhai & Lafferty 2001] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to Ad Hoc Information Retrieval. In SIGIR ’04: Proceeding of the 11th annual international ACM SIGIR conference in Research and development in information retrieval, pp. 179-214, 2004.
    [Zhou et al. 2006] Z.-Y. Zhou, P. Yu, C. Chelba, and F. Seide. Towards Spoken-Document Retrieval for the Internet lattice Indexing For Large-Scale Web-Search Architectures. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pp. 415-422, 2006.
    [陳光華 1999] 陳光華。資訊檢索技術之核心。大學圖書館,3(1),17-28,1999。

    下載圖示
    QR CODE