簡易檢索 / 詳目顯示

研究生: 邱兆揚
Chao-Yang Chiu
論文名稱: 利用Google互聯網分類新聞語料之新詞自動擷取技術支援詞庫式中文斷詞系統
New Word Extraction Utilizing Google News Corpuses for Supporting Lexicon-based Chinese Word Segmentation Systems
指導教授: 洪欽銘
Hong, Chin-Ming
學位類別: 碩士
Master
系所名稱: 電機工程學系
Department of Electrical Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 中文
論文頁數: 71
中文關鍵詞: 中文斷詞新詞擷取Google新聞服務詞庫
英文關鍵詞: Natural language processing, New word extraction, Chinese word segmentation, Information retrieval
論文種類: 學術論文
相關次數: 點閱:290下載:60
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 中文斷詞技術一直都是熱門的研究,許許多多的斷詞方法被提出來,以辭庫為基礎的斷詞方法是最早被使用也是目前最普遍的一種斷詞技術,但此種中文斷詞技術若沒有搭配大量且多樣性的詞庫,其斷詞能力將沒辦法有效地展現出來。尤其是面對新時代的中文資料,現今的中文資料其內容出現許多傳統詞庫所沒有包含的新詞也就是所謂的未知詞,當傳統的詞庫式斷詞系統在處理這類中文資料時,往往因為無法判定中文資料中出現的新詞而造成錯誤,也降低了斷詞系統的正確率。因此一套有效率的中文新詞擷取系統將是必需的。本文提出一套自動產生詞庫的方法,利用Google提供的新聞服務與其特性,建立一新聞類專業詞庫,隨著時間變化每日即時更新此新聞類專業詞庫內容,詞庫中除了儲存所擷取出來的新詞,也記錄新詞的類別與出現的時間點等資訊,將可依賴這些資訊來增加詞庫的正確率,並提供研究者做更進一步的研究。由於新聞內容範圍廣大且多樣性,所以利用每日大量的新聞資料,即可得到各個領域相關的中文字詞,解決現有詞庫不易擴充的問題。也因為新聞資料的特性,中文社會最新出現的詞彙將能夠在最短的時間內被發現並加入詞庫裡。

    實驗的結果證實了本文所提出的方法確實可行。從不同的新聞事件中,擷取出各個領域的字詞,透過中文語言專家的檢測,證明其中包含著傳統詞庫沒有涵蓋的新詞,並具備了可靠的正確率,也證明本方法確實擁有新詞自動擷取的能力。

    Chinese word segmentation in a Chinese sentence is an essential step in the processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is the most widely used method, which can correctly identify Chinese sentences as distinct words from Chinese-language texts for real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by the lexicon-based Chinese word segmentation systems with a constant lexicon. Moreover, to maintain the lexicon by manpower is an inefficient and time-consuming job. Based on the problems, this study proposes a novel statistics-based scheme for new word extraction based on the categorized corpuses of Google news retrieved from the Google news site automatically to promote the word identification ability for the lexicon-based Chinese word segmentation systems. Compared with another proposed method, the experimental results indicated that the proposed new word extraction scheme not only can more correctly retrieve news words from the categorized corpuses of Google news, but also obtain has larger amount of new words.

    中文摘要 i 英文摘要 iii 誌  謝 v 目  錄 vi 圖 目 錄 viii 表 目 錄 ix 第一章  緒論 1 1.1  研究動機與背景 1 1.2  研究目的 4 1.3  研究方法 4 1.4  研究步驟 5 第二章  研究內容與方法 6 2.1  中文新詞擷取 6 2.2  中文斷詞系統與新詞擴充 7 2.3  Google新聞資料庫 9 2.4  Entropy理論 12 第三章  自動新詞擷取方法 14 3.1  新詞自動擷取演算法 14 3.2  舊有詞庫的應用 30 3.3  淘汰過時與錯誤的字詞 32 第四章  實驗分析 35 4.1  新詞擷取 35 4.2  中文新詞詞庫 44 4.3  字詞淘汰 46 4.4  新詞分類 47 4.5  實驗討論 50 第五章  結論 54 參考文獻 55 附 錄 一 57 附 錄 二 59 自  傳 61

    [1] Mao-yuan Zhang , Zheng-ding Lu , Chun-yan Zou,“A Chinese word segmentation based on language situation in processing ambiguous words,” Information Sciences: an International Journal, vol. 162 no. 3-4, pp.275-285, June 2004.
    [2] Foo, S. and Li, H. “Chinese word segmentation and its effect on information retrieval,” Information Processing and Management, vol. 40 Issue 1, pp.161-190, 2004.
    [3] Chen, K.J. and S.H. Liu,“Word Identification for Mandarin Chinese Sentences,” Proceedings of COLING , pp.101-107, 1992.
    [4] Yeh, C. L. and Lee, H. J. , “Rule-based word identification for mandarin Chinese sentences - A unification approach,” International Journal of Computer Processing of Chinese and Oriental Languages, Vol. 5, pp. 97-118, March, 1991.
    [5] Google news,web available at http://news.google.com.tw/
    [6] CKIP, web available at: http://ckipsvr.iis.sinica.edu.tw/
    [7] Chen, K.J. and Ming-Hong Bai, “Unknown Word Detection for Chinese by a Corpus-based Learning Method,”International Journal of Computational linguistics and Chinese Language Processing, Vol.3, #1, pp. 27-44, 1998.
    [8] Chen, K.J. and Wei-Yun Ma, “Unknown Word Extraction for Chinese Documents,” Proceedings of COLING 2002, pp. 169-175.
    [9] Ma Wei-Yun and K.J. Chen, “A bottom-up Merging Algorithm for Chinese Unknown Word Extraction,”Proceedings of ACL Workshop on Chinese Language Processing 2003, pp. 31-38
    [10] Wai Lam, Pik-Shan Cheung, Ruizhang Huang, “Mining events and new name translations from online daily news,” JCDL, 2004, pp. 287-295.
    [11] S.-H. Lin, C.-S. Shih, M. C. Chen, J.-M. Ho, M.-T. Ko, and Y.-M. Huang. “Extracting classification knowledge of internet documents with mining term associations: A semantic approach,” In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 241–249, 1998.
    [12] W.-H. Lu, L.-F. Chien, and H.-J. Lee. “Mining anchor texts for translation of web queries,” ACM Transactions on Asian Language Information Processing, vol.1 Issue2, pp.159–172, 2002.
    [13] Shan He, Jie Zhu “A Bootstrap Method For Chinese New Words Extraction” in Proceedings of ICASSP-2001, vol.1, Speech-L12: Acoustic \and Lexical Modeling, I-581, May 7-11, 2001, Salt Lake City, Utah
    [14] Y. J. Lin and M. S. Yu, “Extracting Chinese Frequent Strings Without a Dictionary From a Chinese Corpus And its Applications,” Journal of Information Science and Engineering, Vol.17, No. 5, pp. 805-824,2001.
    [15] Hongqiao Li, Chang-Ning Huang, Jianfeng Gao and Xiaozhong Fan, “The use of SVM for Chinese new word identification,” In IJCNLP-04. Sanya City, Hainan Island, China, March 22-24, 2004
    [16] Chih-Ming Chen, Hahn-Ming Lee and Chia-Chen Tan, “An intelligent web-page classifier with fair feature-subset selection,”Engineering Applications of Artificial Intelligence.
    [17] Chen, K. J., & Liu, S. H. (1992).“Word identification for mandarin Chinese Sentences,” Proceedings of the Fifteenth International Conference on Computational Linguistics, Nantes, pp.101-107.
    [18] C.Shannon,“A mathematical theory of communication,”Bell Syst.,Tech.J.,27 : pp.379-423,1948.

    QR CODE