簡易檢索 / 詳目顯示

研究生: 胡夢珂
論文名稱: 使用支援向量機進行中文文本可讀性分類-以國小國語課文為例
sing the Support Vector Machine to classify the Chinese text readability – A Case of Elementary Chinese Textbook
指導教授: 張國恩
Chang, Kuo-En
宋曜廷
Sung, Yao-Ting
張道行
Chang, Tao-Hsing
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 75
中文關鍵詞: 可讀性文本分類支援向量機中文斷詞
英文關鍵詞: Readability, Text Classification , Support Vector Machine, Chinese Word Segmentation
論文種類: 學術論文
相關次數: 點閱:323下載:17
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語文能力在各方面都扮演著重要的角色。而獲取語文能力最重要、最直接的管道之一就是透過閱讀。可讀性可以評估一個文本是否適合閱讀者的閱讀能力。以往的研究指出可讀性公式是一個工具,可以把對於不同教育程度的讀者所閱讀的文章加以調整。英文文本的可讀性研究很早就出現了,可是中文領域這方面的研究不多,而中文能力在現今社會又是一個很主要的趨勢。因此,一個適合文本可讀性的分類方法是很重要的。過去西方學者因為過去技術的不足多採用線性的可讀性公式對文本做可讀性分類,而線性的可讀性公式對本研究的資料有些限制,因此本研究的目的在建立一個由支援向量機(Support Vector Machine,SVM)所訓練產生的預測模型,將國小的國語科課文做可讀性的分類。進而觀察預測的課文跟原來實際的課文的年級是否相符,並針對錯誤的課文做分析,以改善與謀求分類上的準確性。
    本研究以課程專家編撰,經國家編審單位審定的三個民間版本教科書(H版、K版、N版),國小一年級至六年級國語科課文刪減掉新詩、絕句、古文、律詩的課文後共計386篇為實驗資料,將課文一部分做為訓練資料,另一部分課文為測試資料,透過中文斷詞的處理及資料格式的轉換,最後以SVM來對文本的可讀性進行分類。研究結果發現:利用LIBSVM預測國小國語科課文冊別的準確率(accuracy)為47.92%、正確率(fit rate)為80.31%。最後針對預測錯誤的課文做錯誤分析,了解是甚麼因素造成預測上的錯誤。

    Language plays an important part in every reign. And the most efficient way to enhance our ability is to read. Readability can estimate whether an article is suitable for one reader. Past researches claim that readability is a mean to adjust the level of article according to different kinds of educational attainment. The research of English readability has been on its way while Chinese has a little progression. However, Chinese is a trend in nowadays. It is important to find a suitable way to classify text readability.
    In the past researches, many western readability formulas do to the lack of technology use linear models on text classification, and linear readability formulas is a limit for the data in my research. Therefore, the purpose of this research is to use the predict model, which trained by the support vector machine, to classify the elementary Chinese textbook’s readability. And to check up that whether the text is matched with the predict text. At last, analyze the wrong text to improve the accuracy of text readability.
    This research was compiled by course expert and the experience materials( from first to sixth grades deleting the classical Chinese texts of three vision texts of private publish enterprise including vision H, K, and N) total 386 texts were examined by the national compilation organization. Part of the texts are used as training materials and the others are testing materials. Through the Chinese Word Segmentation processing and data format conversion, we at last do the text classification by SVM. The research conclusion is that the accuracy of predicting elementary texts is 47.92% while the fit rate is 80.31%. At the end, analyze the wrong prediction and understand the reason of this wrong prediction.

    目錄 附表目錄 ix 附圖目錄 x 第一章 緒論 1 第一節 研究背景 1 第二節 研究目的 4 第二章 文獻探討 5 第一節 可讀性(Readability) 5 第二節 Coh-Metrix 11 第三節 中文斷詞 12 第四節 支援向量機(Support Vector Machine) 13 第三章 系統設計 17 第一節 系統架構 18 第二節 中文可讀性指標分析系統 19 第三節 支援向量機的訓練與測試 27 第四章 實驗設計 29 第一節 實驗工具 29 第二節 實驗資料 29 第三節 實驗流程 30 第四節 實驗結果 43 第五節 實驗結果討論 50 第五章 結論與未來發展 57 第一節 結論 57 第二節 未來發展 58 參考文獻 60 附錄一 Coh-Metrix 2.0可讀性指標 65 附錄二 中研院平衡語料庫詞類標記集 70 附錄三 國小國語科刪減的文章 72

    一、中文部分
    楊孝濚(1978)。影響行文可讀性語言因素的分析。報學,7,58-67。
    許菱祥(1986)。中文文法。大中國圖書公司。
    CKIP 詞庫小組(1993)。中文詞類分析(三版)技術報告。中央研究院資訊科學研
      究所技術報告。
    荊溪昱(1995)。中文國文教材的適讀性研究:適讀年級的推估。教育研究資訊,
      3(3),113-127。
    宋佩貞(1998)。台灣審定版國小英語教科書適讀性公式建置與評估。國立台東大   學教育學研究所教學科技碩士班碩士論文。
    林宗勳,Support Vector Machines簡介,台灣大學通訊與多媒體實驗室,民  95。
    陳稼興、謝佳倫、許芳誠(2000)。以遺傳演算法為基礎的中文斷詞研究。電子商
      務學報,2(2),27-44。
    陳順宇(2000)。迴歸分析(三版),華泰書局。
    張國恩、宋曜廷(2005)。潛在語意分析及概念構圖在文章摘要和理解評量的應   用(3/3)。國家科學委員會專題計畫成果報告(編號:NSC93-2520-S-003-  011)。台北:行政院國家科學委員會。
    張晏晟(2008)。擴展反應型論述反應之自動化評估方法-以教師教學能力為例。國
      立臺灣師範大學資訊教育研究所碩士論文。
    陳茹玲、蘇宜芬(2010)。國小不同認字能力學童辨識中文字詞之字元複雜度效果與  詞長效果研究。國立臺灣師範大學教育心理與輔導學系教育心理學報,41(3),579-604。

    二、西文部分
    Boser, B, Guyon, I, & Vapnik, V. (1992). A Training   
      Algorithm for Optimal Margin Classifier. Proceedings of   the fifth annual workshop on Computational learning   
      theory, 144-152.
    Chall, J.S., & Dale, E.9 (2000). Readability revisited: The   new Dale-Chall readabiliry formula. MA: Brookline Books.
      Chang, C-H., & Lin, C. J. (2001). LIBSVM: a library for   support vector machines.
    Software available at  
      http://www.csie.ntu.edu.tw/~cjlin/libsvm.  Coh-Metrix,http://cohmetrix.memphis.edu/cohmetrixpr/index. html.
    Cortes, C., & Vapnik, V. (1995). Support-Vector Networks.  
      Machine Learning, 20.
    Crossley, S.A., Allen D.B., & McNamara D.S.(2011). Text
      readabiliry and inruirice simplification: A comparison  
      of readabiliry formulas. Reading in a Foreign Language,   23(1),84-101.
    Dubay, W.H. (2004). The Principles of Readability. Costa
      Mesa, CA: BookSurge Publishing.
    Feng, L., Jansche, M., Huenerfauth, M., & Elhadad, N.
      (2010). A Comparison of
    Features for Automatic Readability Assessment. Proceedings
      of The 23rd International Conference on Computational    Linguistics, 276-284.
    Fry, E.B., J. E. Kress, and D. L. Fountoukidis. 1993. The
      reading teacher’s book of lists: Third edition. West
      Nyack, NY: The Center for Applied Research in Education.
    Graesser, A. C., McNamara, D. D., Louwerse, M. L., & Cai, Z.
      (2004). Coh-Metrix: Analysis of text on cohesion and
      language. Behavior Research Methods, Instruments, &
      Computers, 36, 193-202.
    Hsu , C.W., & Lin,C.J. (2003). A Comparison of Methods for
      Support Vector Machine. IEEE Transactions of Neural
      Networks, 13(2).
    Hwang, Shin Ja J. (1992). The Functions of Negation in   
      Narration. In Shin Ja J. Hwang and William R. Merrifield   (eds.), Language in Context: Essay for Robert E.   
      Longacre.321-337. Summer Institute of Linguistics and
      the University of Texas at Arlington Publications in
      Linguistics.
    Joachims, T.(1998). Text Categorization with Support Vector
      Machines: Learning with Many Relevant Features.
      Proceedings of The 10th European Conference on
      Machine Learning, 137-142.
    Jordan, M. P., The power of negation in English: Text,
      context and relevance. Journal of Pragmatics, 29, 705-
      752.
    Klare, G.R. (1963). The Measurement of Readabiliry: Useful
      Information for Communicatiors. Journal of Computer
      Documentation, 24(3).
    Landauer, T. K., Foltz, P. W., & Laham, D. (1998).
      Introduction to latent semantic analysis. Discourse
      Processes, 25, 259–284.
    Larsson, P.( 2006). Classification into readability levels.
      Master’s thesis, Department of Linguistics and
      Philology, University Uppsala, Uppsala, Sweden.
    Li, G.C., Liu, K.Y., & Zhang, Y. K. (1998). Identifying
      Chinese Word and Processing Different Meaning
      Structures. Journal of Chinese Information Processing,  
      2, 45-53.
    Liang, N. Y. (1990). Knowledge of Chinese Word Segmentation.
      Journal of Chinese Information Processing, 4, 42-49.
      Lin, S.Y., Su, C.C., Lai, Y.D., Yang, L.C., & Hsieh S.K.   (2009).Assessing Text
    Readability using hierarchical lexical relations retrieved
      from WordNet. International Journal of Computational  
      Linguistics and Chinese Language Processing, 14(1), 45-   84.
    McNamara, D.S., Louwerse, M.M., McCarthy, P.M., & Graesser,
      A.C. (2010). Coh-Metrix: Capturing linguistic features  
      of cohesion. Discourse Processes, 47, 292-330.
    Parrado-Hernandez, E., & Hardoon, D. (2008). Text
      classification with a Primal SVM endowed with domain
      knowledge. Unpublished. Retrieved from
      http://eprints.pascal-network.org/archive/00004968/
    Petersen, S. & Ostendorf, M.(2009). A machine learning  
      approach to reading level assessment. Computer, Speech
      and Language, 23(1), 106.
    Singh, S.R., Murthy, H.A., & Gonsalves, T.A. (2010). Feature
      Selection for Text Classification Based on Gini
      Coeffcient of Inequality. Proceedings of The Fourth
      Workshop on Feature Selection in Data Mining,76-85.
    Tanaka-Ishii, K., Tezuka, S., & Terada, H.(2010). Sorting
      texts by readability. Computational Linguistics, 36(2),
      203-227.
    Teahan, W. J., McNab, R., Wen Y., and Witten, I. H. (2000).
      A compression-based algorithm for Chinese word
      segmentation. Computational Linguistics, 26(3), 375-393.
    Wu, D. (1998). A position statement on Chinese segmentation.
      Chinese Language Processing Workshop.
    Zhan, J., & Loh, H.T.(2009). Using Redundancy Reduction in
      Summarization to Improve Text Classification by SVMs.
      Journals of Information Science and Engeeneering. 25,
      591-601.
    Zhang, W., Yoshida, T., & Tang, X. (2008). Text
      classification based on multi-word with support vector
      machine. Knowledge-Based Systems, 21(8), 879-886.

    下載圖示
    QR CODE