研究生: |
胡夢珂 |
---|---|
論文名稱: |
使用支援向量機進行中文文本可讀性分類-以國小國語課文為例 sing the Support Vector Machine to classify the Chinese text readability – A Case of Elementary Chinese Textbook |
指導教授: |
張國恩
Chang, Kuo-En 宋曜廷 Sung, Yao-Ting 張道行 Chang, Tao-Hsing |
學位類別: |
碩士 Master |
系所名稱: |
資訊教育研究所 Graduate Institute of Information and Computer Education |
論文出版年: | 2011 |
畢業學年度: | 99 |
語文別: | 中文 |
論文頁數: | 75 |
中文關鍵詞: | 可讀性 、文本分類 、支援向量機 、中文斷詞 |
英文關鍵詞: | Readability, Text Classification , Support Vector Machine, Chinese Word Segmentation |
論文種類: | 學術論文 |
相關次數: | 點閱:347 下載:17 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語文能力在各方面都扮演著重要的角色。而獲取語文能力最重要、最直接的管道之一就是透過閱讀。可讀性可以評估一個文本是否適合閱讀者的閱讀能力。以往的研究指出可讀性公式是一個工具,可以把對於不同教育程度的讀者所閱讀的文章加以調整。英文文本的可讀性研究很早就出現了,可是中文領域這方面的研究不多,而中文能力在現今社會又是一個很主要的趨勢。因此,一個適合文本可讀性的分類方法是很重要的。過去西方學者因為過去技術的不足多採用線性的可讀性公式對文本做可讀性分類,而線性的可讀性公式對本研究的資料有些限制,因此本研究的目的在建立一個由支援向量機(Support Vector Machine,SVM)所訓練產生的預測模型,將國小的國語科課文做可讀性的分類。進而觀察預測的課文跟原來實際的課文的年級是否相符,並針對錯誤的課文做分析,以改善與謀求分類上的準確性。
本研究以課程專家編撰,經國家編審單位審定的三個民間版本教科書(H版、K版、N版),國小一年級至六年級國語科課文刪減掉新詩、絕句、古文、律詩的課文後共計386篇為實驗資料,將課文一部分做為訓練資料,另一部分課文為測試資料,透過中文斷詞的處理及資料格式的轉換,最後以SVM來對文本的可讀性進行分類。研究結果發現:利用LIBSVM預測國小國語科課文冊別的準確率(accuracy)為47.92%、正確率(fit rate)為80.31%。最後針對預測錯誤的課文做錯誤分析,了解是甚麼因素造成預測上的錯誤。
Language plays an important part in every reign. And the most efficient way to enhance our ability is to read. Readability can estimate whether an article is suitable for one reader. Past researches claim that readability is a mean to adjust the level of article according to different kinds of educational attainment. The research of English readability has been on its way while Chinese has a little progression. However, Chinese is a trend in nowadays. It is important to find a suitable way to classify text readability.
In the past researches, many western readability formulas do to the lack of technology use linear models on text classification, and linear readability formulas is a limit for the data in my research. Therefore, the purpose of this research is to use the predict model, which trained by the support vector machine, to classify the elementary Chinese textbook’s readability. And to check up that whether the text is matched with the predict text. At last, analyze the wrong text to improve the accuracy of text readability.
This research was compiled by course expert and the experience materials( from first to sixth grades deleting the classical Chinese texts of three vision texts of private publish enterprise including vision H, K, and N) total 386 texts were examined by the national compilation organization. Part of the texts are used as training materials and the others are testing materials. Through the Chinese Word Segmentation processing and data format conversion, we at last do the text classification by SVM. The research conclusion is that the accuracy of predicting elementary texts is 47.92% while the fit rate is 80.31%. At the end, analyze the wrong prediction and understand the reason of this wrong prediction.
一、中文部分
楊孝濚(1978)。影響行文可讀性語言因素的分析。報學,7,58-67。
許菱祥(1986)。中文文法。大中國圖書公司。
CKIP 詞庫小組(1993)。中文詞類分析(三版)技術報告。中央研究院資訊科學研
究所技術報告。
荊溪昱(1995)。中文國文教材的適讀性研究:適讀年級的推估。教育研究資訊,
3(3),113-127。
宋佩貞(1998)。台灣審定版國小英語教科書適讀性公式建置與評估。國立台東大 學教育學研究所教學科技碩士班碩士論文。
林宗勳,Support Vector Machines簡介,台灣大學通訊與多媒體實驗室,民 95。
陳稼興、謝佳倫、許芳誠(2000)。以遺傳演算法為基礎的中文斷詞研究。電子商
務學報,2(2),27-44。
陳順宇(2000)。迴歸分析(三版),華泰書局。
張國恩、宋曜廷(2005)。潛在語意分析及概念構圖在文章摘要和理解評量的應 用(3/3)。國家科學委員會專題計畫成果報告(編號:NSC93-2520-S-003- 011)。台北:行政院國家科學委員會。
張晏晟(2008)。擴展反應型論述反應之自動化評估方法-以教師教學能力為例。國
立臺灣師範大學資訊教育研究所碩士論文。
陳茹玲、蘇宜芬(2010)。國小不同認字能力學童辨識中文字詞之字元複雜度效果與 詞長效果研究。國立臺灣師範大學教育心理與輔導學系教育心理學報,41(3),579-604。
二、西文部分
Boser, B, Guyon, I, & Vapnik, V. (1992). A Training
Algorithm for Optimal Margin Classifier. Proceedings of the fifth annual workshop on Computational learning
theory, 144-152.
Chall, J.S., & Dale, E.9 (2000). Readability revisited: The new Dale-Chall readabiliry formula. MA: Brookline Books.
Chang, C-H., & Lin, C. J. (2001). LIBSVM: a library for support vector machines.
Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm. Coh-Metrix,http://cohmetrix.memphis.edu/cohmetrixpr/index. html.
Cortes, C., & Vapnik, V. (1995). Support-Vector Networks.
Machine Learning, 20.
Crossley, S.A., Allen D.B., & McNamara D.S.(2011). Text
readabiliry and inruirice simplification: A comparison
of readabiliry formulas. Reading in a Foreign Language, 23(1),84-101.
Dubay, W.H. (2004). The Principles of Readability. Costa
Mesa, CA: BookSurge Publishing.
Feng, L., Jansche, M., Huenerfauth, M., & Elhadad, N.
(2010). A Comparison of
Features for Automatic Readability Assessment. Proceedings
of The 23rd International Conference on Computational Linguistics, 276-284.
Fry, E.B., J. E. Kress, and D. L. Fountoukidis. 1993. The
reading teacher’s book of lists: Third edition. West
Nyack, NY: The Center for Applied Research in Education.
Graesser, A. C., McNamara, D. D., Louwerse, M. L., & Cai, Z.
(2004). Coh-Metrix: Analysis of text on cohesion and
language. Behavior Research Methods, Instruments, &
Computers, 36, 193-202.
Hsu , C.W., & Lin,C.J. (2003). A Comparison of Methods for
Support Vector Machine. IEEE Transactions of Neural
Networks, 13(2).
Hwang, Shin Ja J. (1992). The Functions of Negation in
Narration. In Shin Ja J. Hwang and William R. Merrifield (eds.), Language in Context: Essay for Robert E.
Longacre.321-337. Summer Institute of Linguistics and
the University of Texas at Arlington Publications in
Linguistics.
Joachims, T.(1998). Text Categorization with Support Vector
Machines: Learning with Many Relevant Features.
Proceedings of The 10th European Conference on
Machine Learning, 137-142.
Jordan, M. P., The power of negation in English: Text,
context and relevance. Journal of Pragmatics, 29, 705-
752.
Klare, G.R. (1963). The Measurement of Readabiliry: Useful
Information for Communicatiors. Journal of Computer
Documentation, 24(3).
Landauer, T. K., Foltz, P. W., & Laham, D. (1998).
Introduction to latent semantic analysis. Discourse
Processes, 25, 259–284.
Larsson, P.( 2006). Classification into readability levels.
Master’s thesis, Department of Linguistics and
Philology, University Uppsala, Uppsala, Sweden.
Li, G.C., Liu, K.Y., & Zhang, Y. K. (1998). Identifying
Chinese Word and Processing Different Meaning
Structures. Journal of Chinese Information Processing,
2, 45-53.
Liang, N. Y. (1990). Knowledge of Chinese Word Segmentation.
Journal of Chinese Information Processing, 4, 42-49.
Lin, S.Y., Su, C.C., Lai, Y.D., Yang, L.C., & Hsieh S.K. (2009).Assessing Text
Readability using hierarchical lexical relations retrieved
from WordNet. International Journal of Computational
Linguistics and Chinese Language Processing, 14(1), 45- 84.
McNamara, D.S., Louwerse, M.M., McCarthy, P.M., & Graesser,
A.C. (2010). Coh-Metrix: Capturing linguistic features
of cohesion. Discourse Processes, 47, 292-330.
Parrado-Hernandez, E., & Hardoon, D. (2008). Text
classification with a Primal SVM endowed with domain
knowledge. Unpublished. Retrieved from
http://eprints.pascal-network.org/archive/00004968/
Petersen, S. & Ostendorf, M.(2009). A machine learning
approach to reading level assessment. Computer, Speech
and Language, 23(1), 106.
Singh, S.R., Murthy, H.A., & Gonsalves, T.A. (2010). Feature
Selection for Text Classification Based on Gini
Coeffcient of Inequality. Proceedings of The Fourth
Workshop on Feature Selection in Data Mining,76-85.
Tanaka-Ishii, K., Tezuka, S., & Terada, H.(2010). Sorting
texts by readability. Computational Linguistics, 36(2),
203-227.
Teahan, W. J., McNab, R., Wen Y., and Witten, I. H. (2000).
A compression-based algorithm for Chinese word
segmentation. Computational Linguistics, 26(3), 375-393.
Wu, D. (1998). A position statement on Chinese segmentation.
Chinese Language Processing Workshop.
Zhan, J., & Loh, H.T.(2009). Using Redundancy Reduction in
Summarization to Improve Text Classification by SVMs.
Journals of Information Science and Engeeneering. 25,
591-601.
Zhang, W., Yoshida, T., & Tang, X. (2008). Text
classification based on multi-word with support vector
machine. Knowledge-Based Systems, 21(8), 879-886.