研究生: |
廖邵瑋 Liao, Shao-Wei |
---|---|
論文名稱: |
自然語言處理技術應用於科技華語詞彙分析 A Vocabulary Analysis of Chinese for Science and Technology with Natural Language Processing Methods |
指導教授: |
洪嘉馡
Hong, Jia-Fei |
口試委員: |
謝佳玲
Hsieh, Chia-Ling 曾厚強 Tseng, Hou-Chiang 洪嘉馡 Hong, Jia-Fei |
口試日期: | 2022/06/23 |
學位類別: |
碩士 Master |
系所名稱: |
華語文教學系 Department of Chinese as a Second Language |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 172 |
中文關鍵詞: | 科技華語 、自然語言處理 、LDA 、Word2Vec |
英文關鍵詞: | Chinese for Science and Technology, Natural Language Processing, LDA, Word2Vec |
研究方法: | 內容分析法 |
DOI URL: | http://doi.org/10.6345/NTNU202200683 |
論文種類: | 學術論文 |
相關次數: | 點閱:141 下載:35 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究透過自然語言處理技術進行科技華語真實語料分析,以《泛科學》11067篇文本作為訓練資料,分別訓練LDA主題模型以及Word2Vec詞向量模型,欲藉此輔助科技華語詞彙教學。
在國際化與科技發展的交互作用之下,來華就讀自然科學相關科系的外籍學習者日益增加。為滿足其修習專業課程以及與同儕進行學術交流之學習需求,科技華語課程需銜接通用華語與科技學術華語之間的落差。然對此專業領域的語言相關研究尚有不足,使得科技華語課程與教材存在兩大問題,一是無法針對學習者不同科系的專業選出合適的詞彙進行教學,二是缺乏科技文本語境中詞彙的使用方式分析。為此本研究將聚焦於以下研究目的:第一,篩選不同學科領域主題之科技華語選詞範圍,並提出參考詞表;第二分析科技華語詞彙於通用華語及科技華語語境中的共現詞差異;第三,比較科技華語近義詞之使用情境與共現詞。
首先,本研究根據LDA主題模型的建模結果發現,科普文本中存在「食品科學、營養學」、「生物學、生命科學」、「醫學、藥學、公共衛生」、「學術生活」、「資訊通訊科技、電機電子工程」、「地球科學、環境科學」、「天文學、航太工程」、「物理學、化學、材料科學」與「神經心理學、統計學」九個潛在的科技主題。接著,將各主題的關聯詞彙以國家教育研究院的詞語分級標準檢索系統進行詞彙難易度分級,建置科技華語各領域主題推薦詞表。其後,以上述詞表中的科技詞彙作為示例,應用Word2Vec模型計算詞彙之間的語義相似度,比較科技詞彙於通用華語和科技華語語境中的使用差異,並進行科技華語近義詞分析,以期作為科技華語詞彙教學之參考。
In this study, the LDA topic model and Word2Vec word vector model were trained using 11,067 texts from PanSci as the training data to assist in the teaching of Chinese for science and technology (CST) vocabulary.
With the interaction of internationalization and technological development, the number of foreign learners coming to Taiwan to study science-related subjects is increasing. In order to meet their learning needs for professional courses and academic communication with their peers, the CST curriculum needs to bridge the gap between Chinese for general purposes (CGP) and academic courses. However, there is a lack of language-related research in this area of expertise, which has led to two major problems in CST curricula and materials: the inability to select appropriate vocabulary for teaching learners in different disciplines, and the lack of analysis of vocabulary usage in scientific text contexts. In this study, we will focus on the following objectives: first, to select the range of CST words in different subject areas and propose a reference wordlist; second, to analyze the differences in the co-occurrence of CST words in the contexts of CGP and CST; and third, to compare the usage contexts and co-occurrence of CST synonyms.
First, based on the modeling results of the LDA theme model, we found that there are nine potential topics in science texts: "food science, nutrition", "biology, life science", "medicine, pharmacy, public health", "academic life", "information and communication technology, electrical and electronic engineering", "earth science, environmental science", "astronomy, aerospace engineering", "physics, chemistry, material science", and "neuropsychology, statistics". Then, the associated vocabulary of each topic was graded by the National Academy for Educational Research's word grading system for difficulty, and a list of recommended words for each field of CST was created. After that, we applied the Word2Vec model to calculate the semantic similarity between the words in the above list as an example, compared the differences in the usage of the CST words in the contexts of CGP and CST, and analyzed the synonyms in CST in order to serve as a reference for teaching CST vocabulary.
英文文獻
Alexander, R. J. (1984). Fixed Expressions in English: Reference Books and the Teacher. English Language Teaching Journal, 38(2), 127-134. https://doi.org/10.1093/elt/38.2.127
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A Systematic Comparison of Context-counting vs. Context-predicting Semantic Vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Maryland, United States. https://doi.org/10.3115/v1/P14-1
Berners-Lee, T. I. M., Hendler, J., & Lassila, O. R. A. (2001) The Semantic Web. Scientific American, 284(5), 34-43.
Blei, D. M., & Lafferty, J. D. (2006). Correlated Topic Models. Advances in Neural Information Processing Systems, 18, 147.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language Models are Few-shot Learners. Advances in Neural Information Processing Systems, 3, 877-1901.
Cavnar, W. (1994). Using an N-gram-based Document Representation with a Vector Processing Retrieval Model. Proceedings of the third Text Retrieval Conference. Gaithersburg, Maryland, United States. https://trec.nist.gov/pubs/trec3/papers/cavnar_ngram_94.ps.gz
Chomsky, N. (1965). Aspects of the Theory of Syntax. MIT Press.
Chomsky, N. (1957). Syntactic Structures. The Hague.
Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian LDA for Topic Models with Word Embeddings. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China. https://doi.org/10.3115/v1/P15-1
Deerwester, S., Dumais, S. T., Furnas, G. W. , Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
Dieng, A. B., Wang, C., Gao, J. F., & Paisley, J. (2016). TopicRNN: A Recurrent Nerual Network with Long-Range Semantic Dependency. Proceedings of the 5th International Conference on Learning Representations. arXiv preprint https://doi.org/10.48550/arXiv.1611.0172
Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955 Studies in Linguistic Analysis, 1-32. Basil Blackwell.
Gao, J., He, D., Tan, X., Qin, T., Wang, L., & Liu, T. Y. (2019). Representation Degeneration Problem in Training Natural Language Generation Models. arXiv preprint arXiv:1907.12009.
Griffiths, T. L., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101, 5228-5235. https://doi.org/10.1073/pnas.0307752101
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. Routledge. https://doi.org/10.4324/9781315836010
Hasan, M., Rahman, A., Karim, M. R., Khan, M. S. I., & Islam, M. J. (2021). Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation(LDA). In: Kaiser, M. S., Bandyopadhyay, A., Malhmud, M., Ray, K. (eds) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Advance in Intelligent Systems and Computing, vol 1309, pp. 341-354. Springer, Singapore.
https://doi.org/10.1007/978-981-33-4673-4_27
Hausmann, F. J. (1989). Le Dictionnaire de Collocations. In Wörerbücher, Dictionaries, Dictionnaires. Ein Internationales Handbuch, 1010-1019. de Gruyter, Berlin.
Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. California, United States. https://doi.org/10.1145/312624.312649
Hofmann, T. (2001). Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine learning, 42(1), 177-196.
Holmes, I., Harris, K., & Quince, C. (2012). Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics. PloS one, 7(2), e30126. https://doi.org/10.1371/journal.pone.0030126
Hu, W., & Tsujii, J. I. (2016). A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany.
Kondrak, G. (2005). N-gram Similarity and Distance. Proceedings of the 12th International Conference on String Processing and Information Retrieval. Lecture Notes in Computer Science. Buenos Aires, Argentina. https://doi.org/10.1007/11575832_13
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse processes, 25(2-3), 259-284.
Lebret, R., & Collobert, R. (2013). Word Emdeddings through Hellinger PCA. arXiv preprint. arXiv:1312.5542.
Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Proceedings of the Neural Information Processing Systems. Montreal, Canada.
Luong, M. T., Socher, R., & Manning, C. D. (2013). Better Word Representations with Recursive Neural Networks for Morphology. Proceedings of the 17th Conference on Computational Natural Language Learning. Sofia, Bulgaria.
McCormick, C., & Ryan, N. (2019, May 14). BERT Word Embeddings Tutorial. Retrieved from http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
McIntosh, A. (1961). Patterns and Ranges. Language, 37(3), 325-337. https://doi.org/10.2307/411075
Miao, Y., Yu, L., & Blunsom, P. (2016). Neural Variational Inference for Text Processing. Proceedings of the 33rd International Conference on Machine Learning, 48, 1727-1736. New York, United States. https://proceedings.mlr.press/v48/miao16.html
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and Their Compositionality. Advances in Neural Information Processing Systems, 26, 1-9. http://doi.org/10.48550/arXiv.1310.4546
Moody, C. E. (2016). Mixing Dirchlet Topic Models and Word Embeddings to Make lda2vec. arXiv prepeint. https://doi.org/10.48550/arXiv.1605.02019
Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013). Using of jaccard coefficient for keywords similarity. Proceedings of the International Multiconference of Engineers and Computer Scientists. Hong Kong.
Pecina, P. (2005). An Extensive Empirical Study of Collocation Extraction Methods. Proceedings of the Association for Computational Linguistics Student Research Workshop. Ohio, United States.
Pedregosa, F., Varoquaux, G. , Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
Rehurek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. Proceedings of the 7th International Conference on Language Resources and Evaluation. Msida, Malta.
Sievert, C., & Shirley, K. (2014). LDAvis: A Method for Visualizing and Interpreting Topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. Maryland, United States.
Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference for Topic Models. Proceedings of the International Conference on Learning Representations. Toulon, France. https://doi.org/10.48550/arXiv.1703.01488
Sung, Y. T., Chang, T. H., Lin, W. C., Hsieh, K. S., & Chang, K. E. (2016). Crie: An Automated Analyzer for Chinese Texts. Behavior research methods, 48(4), 1238-1251.
Tseng, H. C., Chen, B., Chang, T. H., & Sung, Y. T. (2019). Integrating LSA-based Hierarchical Conceptual Space and Machine Learning Methods for Leveling the Readability of Domain-specific Texts. Natural Language Engineering, 25(3), 331-361.
Wang, L., Huang, J., Huang, K., Hu, Z., Wang, G., & Gu, Q. (2020). Improving Neural Language Generation With Spectrum Control. Proceedings of the International Conference on Learning Representations. Addis Ababa, Ethiopia. https://openreview.net/form?id=ByxY8CNtvr
Wang, Y., Cui, L., & Zhang, Y. (2021). Improving Skip-gram Embeddings Using BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1318-1328. http://doi.org/10.1109/TASLP.2021.3065201
Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A Biterm Topic Model for Short Texts. Proceedings of the 22nd International Conference on World Wide Web. Rio de Janeiro, Brazil. https://doi.org/10.1145/2488388.2488514
Yih, W. t., & Qazvinian, V. (2012). Measuring Word Relatedness Using Heterogeneous Vector Space Models. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montreal, Canada, 616-620. https://dl.acm.org/doi/abs/10.5555/2382029.2382130
Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. Proceedings of the 20th Association for Computing Machinery International Conference on Knowledge Discovery and Data Mining. New York, United States. https://doi.org/10.1145/2623330.2623715
中文文獻
王業奇(2016)。略論來華預科生科技漢語詞彙教學策略。華文教學與研究,62(2),53-59。
代睿(2011)。理工科學歷留學生漢語預備教育存在的問題與改進措施。高教論壇,2,85-87。
代睿(2012)。留學生科技漢語課的詞彙教學。文教資料,28,169-170。
代睿(2015)。留學生科技漢語教材編寫模式探析。現代語文(語言研究版),2,95-97。
代睿(2016)。留學生預科階段基礎科技漢語教材的選材原則。現代語文(語言研究版),12,104-107。
代睿(2021)。來華留學生科技漢語教材建設探索。教書育人(高教論壇),12,74-76。
余同瑞、金冉、韓曉臻、李家輝、郁婷(2020)。自然語言處理預訓練模型的研究綜述。計算機工程與應用,23,12-22。
李詩敏、白明弘、吳鑑城、黃淑齡、林慶隆(2016)。中文近義詞的偵測與辨別。第二十八屆自然語言與語音處理研討會,臺南,臺灣,342-351。https://aclanthology.org/O16-1030.pdf
宋曜廷、陳茹玲、李宜憲、查日龢、曾厚強、林維駿、張道行、張國恩(2013)。中文文本可讀性探討:指標選取、模型建立與效度驗證。中華心理學刊,55(1),75-106。https://doi.org/10.6129/cjp.20120621
李舟軍、范宇、吳賢杰(2020)。面向自然語言處理的預訓練技術研究綜述。計算機科學,47(3),162-173。
杜厚文(1981)。漢語科技文體的語言特點。語言教學與研究,2,87-160。
杜厚文(1986)。關於外國留學生科技漢語教學體制和教材問題。語言教學與研究,4,36-43。
杜厚文(1993)。《普通漢語教程》和《科技漢語教程》的編寫原則與設計方法。世界漢語教學,2,128-133。
周練(2015)。Word2vec的工作原理及應用探究。科技情報開發與經濟,2,145-148。
尚斌、戴莉、朱晶松(2017)。理工科院校留學生科技漢語課程教學環節研究。衛生職業教育,3,44-46。
孫旭東、戴衛平(2017)。科技詞彙的基本特點探討。中國科技術語,1,15-20。
孫雁雁、司書景(2013)。「科技漢語」教學目標設定及可行性分析——以第一學期課堂練習追蹤研究為例。北京郵電大學學報(社會科學版),2,101-105。
秦武(2001)。淺談科技漢語及教學問題。語言與翻譯,4,62-64。
張仁武(1989)。科技漢語及教學。喀什師范學院學報,5,78-83。
張桂賓(2011)。理工專業來華留學生預科教育的實踐與構想。高校教育管理,5,70-73。
張瑩(2014)。近30年科技漢語教材編寫情況的回顧與思考。出版發行研究,11,66-68。
張黎、張曄、高一瑄(2016)。專門用途漢語教學。北京語言大學出版社。
許涓(2014)。來華留學生預科教育院校(理工類)師資隊伍評估方案研究。國際漢語教學研究,3,64-71。
郭德蔭(1986)。科技漢語詞彙的特點。語言教學與研究,2,127-136。
陳明蕾、王學誠、柯華葳(2009)。中文語意空間建置及心理效度驗證:以潛在語意分析技術為基礎。中華心理學刊,51(4),415-435。
單韻鳴(2008)。專門用途漢語教材的編寫問題——以《科技漢語閱讀教程》系列教材為例。暨南大學華文學院學報,2,31-37。
單韻鳴、安然(2009)。專門用途漢語課程設置探析——以《科技漢語》課程為例。西南民族大學學報(人文社科版),8,258-263。
彭妮絲(2017)。專業華語的教與學。華文世界,119,72-78。
黃居仁、陳克健、陳鳳儀、魏文真、張麗麗(1997)。資訊用中文分詞規範設計理念及規範內容。語言文字應用學刊,第一期,92-100。
黃佳佳、李鵬偉、彭敏、謝倩倩、徐超(2020)。基於深度學習的主題模型研究。計算機學報,43(5),827-855。
廖宜瑤、陳鈺茹(2016)。科技華語概論。彭妮絲,專業華語概論(183-208頁)。新學林。
韓志剛、董杰(2010)。科技漢語教材編寫中的選詞問題。文教資料,26,51-53。
韓亞楠、劉建偉、羅雄麟(2021)。概率主題模型綜述。計算機學報,44(6),1095-1139。
顧雯、王娟(2020)。人工智能在華語教學中的應用。軟件導刊,19(6),39-43。
網路相關資源
中文詞彙特性素描系統。取自:https://wordsketch.ling.sinica.edu.tw/
中央研究院現代漢語平衡語料庫4.0。取自:http://asbc.iis.sinica.edu.tw/
馬偉雲、王欣陽、薛祐婷、范植昇、楊慕、陳紀嫣(2016)。中文向量表達,取自:https://ckip.iis.sinica.edu.tw/project/embedding
國教院索引典系統。取自:http://coct.naer.edu.tw/cqpweb/
曾元顯(2012年10月)。自然語言處理natural language processing。圖書館學與資訊科學大辭典,取自:http://terms.naer.edu.tw/detail/1678997/
教育部統計處(2021年1月29日)。109學年度大專校院正式修讀學位之外國學生及其畢業生人數。取自:https://stats.moe.gov.tw/files/detail/109
教育部統計處(2022年1月28日)。110學年度大專校院正式修讀學位之外國學生及其畢業生人數。取自:https://stats.moe.gov.tw/files/detail/110
教育部統計處(2017年9月)。大專校院學科標準分類(第5次修正)。取自:https://stats.moe.gov.tw/bcode/
楊慕、馬偉雲(2020)。CKIP Transformers。取自:https://github.com/ckiplab/ckip-transformer,發表日期2020年11月18日。