簡易檢索 / 詳目顯示

研究生: 黃耀民
Yao-Min Huang
論文名稱: 以字句擷取為基礎並應用於文件分類之自動摘要之研究
Research on Sentence Extraction-based Automatic Summarization Applied to Document Classification
指導教授: 葉耀明
Yeh, Yao-Ming
陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 78
中文關鍵詞: 摘要潛藏語意分析隱藏式馬可夫模型主題混合模型K-最近鄰分類器
英文關鍵詞: Summarization, Latent Semantic Analysis, Hidden Markov Model, Topical Mixture Model, K-Nearest-Neighbor Classifier
論文種類: 學術論文
相關次數: 點閱:323下載:23
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 摘錄式(Extractive)摘要旨在於從原始文件中依據摘要比例自動選取一些重要的字句、段落或章節,並按順序將其形成簡潔摘要。大多數常見的摘要模型原則上可依據其特性分為兩種比對策略。其一,以逐字比對(Literal Term Matching)的方式評估字句與文件的相關性,這其中以向量空間模型(Vector Space Model, VSM)為代表;其二,以概念比對(Concept Matching)的方式評估,這其中以潛藏語意分析(Latent Semantic Analysis, LSA)為代表。
    基於這些觀察,在本研究中我們提出數種自動文件摘要的改進方法。在逐字比對上,研究隱藏式馬可夫模型(Hidden Markov Model, HMM),並對其兩種變化(型一及型二)做廣泛的探討。於隱藏式馬可夫模型-型一:視文件為一生成模型(Generative Model),對於每個索引都有一對應的機率分佈,文件與文件中每一字句的相關性,是藉由字句的所有索引,被文件模型生成相似值(Likelihood)的連乘積來決定,換句話說當字句含有較高的相似值,則其與文件的相關性就越高;於隱藏式馬可夫模型-型二:則視文件中每一字句為一機率生成模型,文件中每一字句與文件的相關性,是藉由文件被字句生成的相似值來決定,並且文件中各字句可依據其所產生的相似值作排序。另一方面,在概念比對上,提出兩種摘要模型,分別為嵌入式潛藏語意分析(embedded LSA)與主題混合模型(Topical Mixture Model, TMM)。於嵌入式潛藏語意分析:文件與文件中每一字句同時參與潛藏語意空間的建構,並且字句的重要性可經由適當評估在潛藏語意空間內,其向量表示式與文件的相關性而得;於主題混合模型:文件中每一字句被分別表示成一混合模型,並由K個潛藏主題分佈及其相對應特定文件的事後機率所組成,文件中每一字句與文件相關性,即可藉由文件中索引發生在潛藏主題及字句產生各別主題的機率值來評估。我們在中文語音廣播新聞語料庫上執行了一系列的實驗,實驗結果顯示使用隱藏式馬可夫模型或主題混合模型其結果較其它常見方法有顯著的提升,同時主題混合模型在幾乎所有情況下均較隱藏式馬可夫模型來得佳。
    最後,我們也研究摘要模型中主題混合模型在文件分類的適用性,並且文件也能預先經由上述摘要模型做前處理。初步實驗結果顯示,主題混合模型分類器較常見K-最近鄰(K-Nearest-Neighbor, KNN)分類器在分類的效果上有些微的提升。

    關鍵字:摘要、潛藏語意分析、隱藏式馬可夫模型、主題混合模型、
    K-最近鄰分類器

    The purpose of extractive summarization is to automatically select a number of salient sentences, passages, or paragraphs from the original document according to a target summarization ratio and then sequence them to form a concise summary. Most of the conventional summarization models in principle can be characterized by two matching strategies. One is to measure the relevance between the sentence and document by the literal term matching, as exampled by the Vector Space Model (VSM); while the other is to measure such relevance by the concept matching, as exampled by the Latent Semantic Analysis (LSA).
    Based on these observations, in the study, we propose several improved approaches for automatic document summarization. For literal term matching, the Hidden Markov Model (HMM) is investigated, and of which two variants (HMM-Type1 and HMM-Type2) were extensively studied. In the HMM-Type1, the document is viewed as a generative model which has the probability distribution over each indexing term. The relevance between the document and each sentence in the document is measured by the product of the likelihoods of all indexing terms of the sentence generated by the document model. In other words, the sentences with higher likelihood scores are more likely to be relevant to the document. In the HMM-Type2, each sentence in the document is viewed as the probability generative model instead. The relevance between the document and a given sentence is measured by the likelihood that the document is generated by the sentence, and the sentences in the document can be ranked according to the respective likelihood scores they generate. On the other hand, for concept matching, two summarization models, i.e., the embedded Latent Semantic Analysis (embedded LSA) and the Topical Mixture Model (TMM), were proposed. In the embedded LSA, the document to be summarized is also involved in the construction of the latent semantic space, and then the importance of each sentence can be properly measured by the proximity of its vector representation to that of the document in the latent semantic space. While in the TMM, each sentence in the document is respectively taken as a mixture model consisting of K latent topical distributions and their corresponding document-specific posterior probabilities. The relevance between each sentence in the document and document itself is then measured based on the likelihoods of the indexing terms of the document observed in the latent topics and the probabilities that the sentence generates the respective topics. A series of experiments has been conducted on the Mandarin broadcast news speech and the experimental results show that the performance achieved by using either the HMM or the TMM is significantly better than that of the conventional approaches, and the TMM further outperforms the HMM in almost all conditions.
    Finally, we also study the possibility to adapt the TMM summarization model to document classification, for which the documents also may be preprocessed beforehand by the summarization models mentioned above. The initial experimental results show that the TMM classifier yields slightly superior performance when compared to the conventional K-Nearest-Neighbor classifier.

    Keywords:Summarization, Latent Semantic Analysis, Hidden Markov Model,Topical Mixture Model, K-Nearest-Neighbor Classifier

    表目錄 IX 圖目錄 XI 第1章 緒論 1 1.1 研究動機與目的 1 1.2 自動摘要 1 1.3 研究成果 2 1.4 章節安排 3 第2章 相關文獻探討 4 2.1 向量空間模型(Vector Space Model, VSM) 5 2.2 相關評估(Relevance Measure, RM) 6 2.3 潛藏語意分析(Latent Semantic Analysis, LSA) 7 2.3.1 索引與字句矩陣 7 2.3.2 奇異值分解(Singular Value Decomposition, SVD) 8 2.3.3 潛藏語意分析摘要模型 8 2.4 馬可夫模型(Markov Model) 9 2.5 隱藏式馬可夫模型(Hidden Markov Model, HMM) 10 2.6 統計式語言模型(Statistical Language Model, SLM) 12 2.7 主題混合模型(Topical Mixture Model, TMM) 13 2.7.1 主題混合模型訓練 15 2.7.1.1 主題混合模型訓練—監督式 15 2.7.1.2 主題混合模型訓練—非監督式 16 第3章 自動摘要模型 17 3.1 嵌入式潛藏語意分析(Embedded LSA)模型 17 3.2 隱藏式馬可夫模型-型一(HMM-Type1) 18 3.3 隱藏式馬可夫模型-型二(HMM-Type2) 20 3.4 主題混合模型(Topical Mixture Model, TMM) 21 第4章 實驗語料與實驗 25 4.1實驗語料 25 4.2 斷詞與統計式語言模型 27 4.3 自動摘要評估 28 4.3.1 餘弦評估 28 4.3.2 ROUGE 評估 29 4.3.3 平均精確度評估 30 4.4 實驗結果 32 4.4.1 餘弦評估 33 4.4.2 ROUGE 評估 36 4.4.2 平均精確度評估 39 4.5 綜合比較 46 4.6 隱藏式馬可夫模型與主題混合模型進一步實驗 47 4.7 本章小結 50 第5章 自動摘要於文件分類上之應用 51 5.1 分類(Classification)與分群(Clustering) 51 5.2 特徵抽取 52 5.3 分類器(Classifier) 53 5.3.1 空間向量模型(Vector Space Model, VSM) 53 5.3.2 單純貝式(Nave Bayes, NB)模型 53 5.3.3 K-最近鄰(K-Nearest-Neighbor, KNN) 54 5.3.4 分類器比較 54 5.4 主題混合模型分類器 55 5.5 實驗設定 56 5.6 實驗結果 57 5.7 本章小結 59 第6章 結論與展望 60 6.1 結論 60 6.2 未來展望 61 參考文獻 62

    [東森新聞報] http://member.ettoday.com/newsflash/index.php
    [中央通訊社] 中央通訊社(Central News Agency), http://210.69.89.224/search/hypage.cgi?HYPAGE=login.htm
    [葉鎮源 2002] 葉鎮源,『文件自動化摘要方法之研究及其在中文文件的應用』,國立交通大學資訊科學研究所,2002年碩士論文。
    [何遠 2003] 何遠,『中文口語文件自動摘要之初步研究』,碩士論文,國立臺灣大學電信工程學研究所,2003。
    [黃上銘等 2003] 黃上銘、劉昭麟、高照明,『適性化線上英語聽寫測驗系統之研究』,2003人工智慧模糊系統及灰色系統聯合研討會論文集 (TAAI'03),CD-ROM。台灣台北,4-6 December 2003。
    [黃建霖 2004] 黃建霖,『應用平行語料和語意相依法則於中文語音文件之摘要』,碩士論文,國立成功大學資訊工程學系碩士班 , 2004。
    [Aas et al. 1999] K. Aas, L. Eikvil and R. B. Huseby, “Applications of hidden Markov chains in image analysis”, The Journal of Pattern Recognition Society, 32, pp.703-713, 1999.
    [Baeza-Yates et al. 1999] Ricardo Baeza-Yates and Berthier Ribeiro-Neto,”Modern Information Retrieval” , 1999, pages 27-30,Addison Wesley.
    [Ball and Hall 1967] Ball G.H., Hall D.J., “A Clustering Technique for Summarizing Multivariate Data”, Behavioral Science, 1967, Vol. 12, 153-155.
    [Baum et al. 1966] Baum L.E., T. Petrie, "Statistical inference for probabilistic functions of finite state Markov chains", Annals of the Institute of Statistical Mathematics, 37, pp.1554-1563, 1966.
    [Bellegarda 2000] J. R. Bellegarda., “Exploiting latent semantic information in statistical language modeling”. Proceedingsof the IEEE, 2000, 88(8):1279–1296.
    [Berry and Browne, 1999] Berry, M.W. and Browne, M., “Understanding Search Engines: Mathematical Modeling and Text Retrieval”, Philadelphia SIAM, 1999.
    [Chen et al. 2002] Berlin Chen, Hsin-Min Wang and Lin-shan Lee, “Discriminating Capability of Syllable-based Features and Approaches of Utilizing Them for Voice Retrieval of Speech Information in Mandarin Chinese”, IEEE Trans. on Speech and Audio Processing, Vol.10, No5, July 2002, pp. 303-314.
    [Chen et al. 2004a] Berlin Chen, Hsin-min Wang, Lin-shan Lee, “A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Transactions on Asian Language Information Processing (ACM TALIP), Vol. 3, No. 2, June 2004, pp. 128-145.
    [Chen et al. 2004b] Berlin Chen, Jen-Wei Kuo, Yao-Min Huang, Hsin-min Wang, “Statistical Chinese Spoken Document Retrieval Using Latent Topical Information” the 8th International Conference on Spoken Language Processing (ICSLP 2004), Vol. II, 1621-1625, Jeju island, South Korea, October 4-8, 2004.
    [Chen 2005] Berlin Chen, “Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval,” accepted for publication in Pattern Recognition Letters, 2005. (SCI Expanded, EI)
    [Dempster et al. 1977] Dempster, A.P., Laird, N. M., Rubin, D.B. 1977. “Maximum likelihood from incomplete data via the EM algorithm”. Journal of Royal Statistical Society B, Vol. 39, No. 1, 1-38.
    [Duda and Hart 1973] Duda R.O., Hart P.E., 1973, “Pattern Classification and Scene Analysis”, John Wiley Sons.
    [Edmundson 1969] Edmundson H.P., 1969. “New Methods in Automatic Extraction.” Journal of the ACM 16(2) pp 264-285.
    [Hirohata et al. 2005] Makoto Hirohata, Shinnaka Y., Iwano K. and Sadaoki Furui, “Sentence Extraction-based presentation summarization techniques and evaluation metrics”, ICASSP05.
    [Hovy and Marcu 1998] Eduard Hovy and Daniel Marcu “Automated Text Summarization Tutorial” , COLING/ACL 1998.
    [G. Furnas et al. 1988] G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, “Information retrieval using a singular value decomposition model of latent semantic structure,” in The 11th International Conference on Research and Development in Information Retrieval, Grenoble, France: ACM Press, 1988, pp. 465--480.
    [Giles et al. 2003] Giles, J.T., Wo, L., Berry, M.W. (2003), “GTP (General Text Parser) software for Tex mining.” In Bozdogan, H. (Ed.). Statistical Data Mining and Knowledge Discover, , Boca Raton, FL CRC Press.
    [Gong and Liu 2001] Yihong Gong,Xin Liu,”Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis”,ACM SIGIR,2001.
    [Google] http://www.google.com
    [Joachims 1998] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features.”, In Procedings 10th European Conference on Machine Learning (ECML’98),1998.
    [KIM et al. 2004] WOOSUNG KIM and SANJEEV KHUDANPUR “Lexical Triggers and Latent Semantic Analysis for Cross-Lingual Language Model Adaptation”, ACM Transactions on Asian Language Information Processing, Vol. 3, No. 2, June 2004, Pages 94–112.
    [Lee et al. 2003]Lin-shan Lee, Yuan Ho, Jia-fu Chen, Shun-Chuan Chen, ”Why is the Special Structure of the Language Important for Chinese Spoken Language Processing? -Examples on Spoken Document Retrieval, Segmentation, and Summarization”, 2003.
    [Levenshtein 1966] V. I. Levenshtein, “Binary codes capable of correctingdeletions, insertions, and reversals,” Cybernetics and Control Theory, Vol. 10, No. 8, pp. 707-710, 1966.
    [Lin 2003] Lin, C.-Y. 2003, “ROUGE: Recall-oriented understudy for gisting evaluation.”, http://www.isi.edu/~cyl/ROUGE/
    [Luhn 1959] H. P. Luhn , “The Automatic Creation of Literature Abstracts.” IBM Journal of Research and Development 159-165.
    [Miller et al. 1999] Miller, D. R. H., Leek, T., Schwartz, R. “A Hidden Markov Model Information Retrieval System. In Proceedings of ACM SIGIR Conference on R&D in Information Retrieval”,214-221.1999.
    [News 98] http://www.news98.com.tw/
    [Orasan 2002] Constantin Orasan, ”Issues in Evaluation of Automatic Summarization”, http://www.clg.wlv.ac.uk/papers/orasan-report-02.pdf , 2002.
    [Rabiner et al. 1989] Rabiner, L. R., “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol.77, No.22, pp.257-286, 1989.
    [Rosenfeld 2000] Rosenfeld, R. 2000, “Two decades of statistical language modeling: Where do we go from here? “, Proc. of the IEEE, 88:1270-1278, August.
    [Salamatian et al. 2001] Kave Salamatian, Sandrine Vaton, “Hidden Markov modeling for network communication channels”, ACM SIGMETRICS 2001.
    [Thede et al. 1999] S. M. Thede and M. P. Harper. “A second-order Hidden Markov Model for part-of-speech tagging”, ACL1999.
    [Sebastiani 2002] Farbrizio Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Vol.34, No.1, March 2002, pp.1-47.
    [Shen et al. 2004] Dou Shen, Zheng Chen, Hua-Jun Zeng, Benyu Zhang, Qiang Yang, Wei-Ying Ma, Yuchang Lu, “Web-page Classification through Summarization”, The 27th Annual International ACM SIGIR Conference (SIGIR'2004), 2004.
    [Siivola et al. 2001] Vesa Siivola, Mikko Kurimo, Krista Lagus, “Large Vocabulary Statistical Language Modeling for Continuous Speech Recognition in Finnish”, EUROSPEECH 2001.
    [SRI Toolkit] SRI International, “SRILM - The SRI Language Modeling Toolkit”, http://www.speech.sri.com/project/srilm/
    [V. Tam et al. 2002] V. Tam, A. Santoso and R. Setiono, “A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization.”, In Proceedings of the 16th International Conference on Pattern Recognition, ICPR 2002, Vol. 4 Quebec, Canada, August 2002, pages 235-238.
    [Wang et al. 2004] Gang Wang, Frederick H. Lochovsky, Qiang Yang, “Feature Selection with Conditional Mutual Information MaxiMin in Text Categorization”, CIKM’04, November 8–13, 2004.
    [Yang et al. 1997] Y. Yang, and Pedersen J. O., “A comparative study on feature selection in text categorization”, Proceedings of the 14th International Conference on Machine Learning ICML97, pages 412-420, 1997.
    [Yang et al. 1999] Yiming Yang and Xin Liu, “A Re-Examination of Text Categorization Methods,” Proceedings of the 22nd Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1999, Pages 42 – 49.

    QR CODE