研究生: |
廖邵瑋 Liao, Shao-Wei |
論文名稱: |
自然語言處理技術應用於科技華語詞彙分析 A Vocabulary Analysis of Chinese for Science and Technology with Natural Language Processing Methods |
指導教授: |
Hong, Jia-Fei |
口試委員: |
Hsieh, Chia-Ling 曾厚強 Tseng, Hou-Chiang 洪嘉馡 Hong, Jia-Fei |
口試日期: | 2022/06/23 |
學位類別: |
碩士 Master |
系所名稱: |
華語文教學系 Department of Chinese as a Second Language |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 172 |
中文關鍵詞: | 科技華語 、自然語言處理 、LDA 、Word2Vec |
英文關鍵詞: | Chinese for Science and Technology, Natural Language Processing, LDA, Word2Vec |
研究方法: | 內容分析法 |
DOI URL: | http://doi.org/10.6345/NTNU202200683 |
論文種類: | 學術論文 |
相關次數: | 點閱:280 下載:41 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
In this study, the LDA topic model and Word2Vec word vector model were trained using 11,067 texts from PanSci as the training data to assist in the teaching of Chinese for science and technology (CST) vocabulary.
With the interaction of internationalization and technological development, the number of foreign learners coming to Taiwan to study science-related subjects is increasing. In order to meet their learning needs for professional courses and academic communication with their peers, the CST curriculum needs to bridge the gap between Chinese for general purposes (CGP) and academic courses. However, there is a lack of language-related research in this area of expertise, which has led to two major problems in CST curricula and materials: the inability to select appropriate vocabulary for teaching learners in different disciplines, and the lack of analysis of vocabulary usage in scientific text contexts. In this study, we will focus on the following objectives: first, to select the range of CST words in different subject areas and propose a reference wordlist; second, to analyze the differences in the co-occurrence of CST words in the contexts of CGP and CST; and third, to compare the usage contexts and co-occurrence of CST synonyms.
First, based on the modeling results of the LDA theme model, we found that there are nine potential topics in science texts: "food science, nutrition", "biology, life science", "medicine, pharmacy, public health", "academic life", "information and communication technology, electrical and electronic engineering", "earth science, environmental science", "astronomy, aerospace engineering", "physics, chemistry, material science", and "neuropsychology, statistics". Then, the associated vocabulary of each topic was graded by the National Academy for Educational Research's word grading system for difficulty, and a list of recommended words for each field of CST was created. After that, we applied the Word2Vec model to calculate the semantic similarity between the words in the above list as an example, compared the differences in the usage of the CST words in the contexts of CGP and CST, and analyzed the synonyms in CST in order to serve as a reference for teaching CST vocabulary.
Alexander, R. J. (1984). Fixed Expressions in English: Reference Books and the Teacher. English Language Teaching Journal, 38(2), 127-134. https://doi.org/10.1093/elt/38.2.127
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A Systematic Comparison of Context-counting vs. Context-predicting Semantic Vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Maryland, United States. https://doi.org/10.3115/v1/P14-1
Berners-Lee, T. I. M., Hendler, J., & Lassila, O. R. A. (2001) The Semantic Web. Scientific American, 284(5), 34-43.
Blei, D. M., & Lafferty, J. D. (2006). Correlated Topic Models. Advances in Neural Information Processing Systems, 18, 147.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language Models are Few-shot Learners. Advances in Neural Information Processing Systems, 3, 877-1901.
Cavnar, W. (1994). Using an N-gram-based Document Representation with a Vector Processing Retrieval Model. Proceedings of the third Text Retrieval Conference. Gaithersburg, Maryland, United States. https://trec.nist.gov/pubs/trec3/papers/cavnar_ngram_94.ps.gz
Chomsky, N. (1965). Aspects of the Theory of Syntax. MIT Press.
Chomsky, N. (1957). Syntactic Structures. The Hague.
Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian LDA for Topic Models with Word Embeddings. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China. https://doi.org/10.3115/v1/P15-1
Deerwester, S., Dumais, S. T., Furnas, G. W. , Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
Dieng, A. B., Wang, C., Gao, J. F., & Paisley, J. (2016). TopicRNN: A Recurrent Nerual Network with Long-Range Semantic Dependency. Proceedings of the 5th International Conference on Learning Representations. arXiv preprint https://doi.org/10.48550/arXiv.1611.0172
Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955 Studies in Linguistic Analysis, 1-32. Basil Blackwell.
Gao, J., He, D., Tan, X., Qin, T., Wang, L., & Liu, T. Y. (2019). Representation Degeneration Problem in Training Natural Language Generation Models. arXiv preprint arXiv:1907.12009.
Griffiths, T. L., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101, 5228-5235. https://doi.org/10.1073/pnas.0307752101
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. Routledge. https://doi.org/10.4324/9781315836010
Hasan, M., Rahman, A., Karim, M. R., Khan, M. S. I., & Islam, M. J. (2021). Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation(LDA). In: Kaiser, M. S., Bandyopadhyay, A., Malhmud, M., Ray, K. (eds) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Advance in Intelligent Systems and Computing, vol 1309, pp. 341-354. Springer, Singapore.
Hausmann, F. J. (1989). Le Dictionnaire de Collocations. In Wörerbücher, Dictionaries, Dictionnaires. Ein Internationales Handbuch, 1010-1019. de Gruyter, Berlin.
Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. California, United States. https://doi.org/10.1145/312624.312649
Hofmann, T. (2001). Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine learning, 42(1), 177-196.
Holmes, I., Harris, K., & Quince, C. (2012). Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics. PloS one, 7(2), e30126. https://doi.org/10.1371/journal.pone.0030126
Hu, W., & Tsujii, J. I. (2016). A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany.
Kondrak, G. (2005). N-gram Similarity and Distance. Proceedings of the 12th International Conference on String Processing and Information Retrieval. Lecture Notes in Computer Science. Buenos Aires, Argentina. https://doi.org/10.1007/11575832_13
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse processes, 25(2-3), 259-284.
Lebret, R., & Collobert, R. (2013). Word Emdeddings through Hellinger PCA. arXiv preprint. arXiv:1312.5542.
Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Proceedings of the Neural Information Processing Systems. Montreal, Canada.
Luong, M. T., Socher, R., & Manning, C. D. (2013). Better Word Representations with Recursive Neural Networks for Morphology. Proceedings of the 17th Conference on Computational Natural Language Learning. Sofia, Bulgaria.
McCormick, C., & Ryan, N. (2019, May 14). BERT Word Embeddings Tutorial. Retrieved from http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
McIntosh, A. (1961). Patterns and Ranges. Language, 37(3), 325-337. https://doi.org/10.2307/411075
Miao, Y., Yu, L., & Blunsom, P. (2016). Neural Variational Inference for Text Processing. Proceedings of the 33rd International Conference on Machine Learning, 48, 1727-1736. New York, United States. https://proceedings.mlr.press/v48/miao16.html
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and Their Compositionality. Advances in Neural Information Processing Systems, 26, 1-9. http://doi.org/10.48550/arXiv.1310.4546
Moody, C. E. (2016). Mixing Dirchlet Topic Models and Word Embeddings to Make lda2vec. arXiv prepeint. https://doi.org/10.48550/arXiv.1605.02019
Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013). Using of jaccard coefficient for keywords similarity. Proceedings of the International Multiconference of Engineers and Computer Scientists. Hong Kong.
Pecina, P. (2005). An Extensive Empirical Study of Collocation Extraction Methods. Proceedings of the Association for Computational Linguistics Student Research Workshop. Ohio, United States.
Pedregosa, F., Varoquaux, G. , Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
Rehurek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. Proceedings of the 7th International Conference on Language Resources and Evaluation. Msida, Malta.
Sievert, C., & Shirley, K. (2014). LDAvis: A Method for Visualizing and Interpreting Topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. Maryland, United States.
Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference for Topic Models. Proceedings of the International Conference on Learning Representations. Toulon, France. https://doi.org/10.48550/arXiv.1703.01488
Sung, Y. T., Chang, T. H., Lin, W. C., Hsieh, K. S., & Chang, K. E. (2016). Crie: An Automated Analyzer for Chinese Texts. Behavior research methods, 48(4), 1238-1251.
Tseng, H. C., Chen, B., Chang, T. H., & Sung, Y. T. (2019). Integrating LSA-based Hierarchical Conceptual Space and Machine Learning Methods for Leveling the Readability of Domain-specific Texts. Natural Language Engineering, 25(3), 331-361.
Wang, L., Huang, J., Huang, K., Hu, Z., Wang, G., & Gu, Q. (2020). Improving Neural Language Generation With Spectrum Control. Proceedings of the International Conference on Learning Representations. Addis Ababa, Ethiopia. https://openreview.net/form?id=ByxY8CNtvr
Wang, Y., Cui, L., & Zhang, Y. (2021). Improving Skip-gram Embeddings Using BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1318-1328. http://doi.org/10.1109/TASLP.2021.3065201
Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A Biterm Topic Model for Short Texts. Proceedings of the 22nd International Conference on World Wide Web. Rio de Janeiro, Brazil. https://doi.org/10.1145/2488388.2488514
Yih, W. t., & Qazvinian, V. (2012). Measuring Word Relatedness Using Heterogeneous Vector Space Models. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montreal, Canada, 616-620. https://dl.acm.org/doi/abs/10.5555/2382029.2382130
Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. Proceedings of the 20th Association for Computing Machinery International Conference on Knowledge Discovery and Data Mining. New York, United States. https://doi.org/10.1145/2623330.2623715
曾元顯(2012年10月)。自然語言處理natural language processing。圖書館學與資訊科學大辭典,取自:http://terms.naer.edu.tw/detail/1678997/
楊慕、馬偉雲(2020)。CKIP Transformers。取自:https://github.com/ckiplab/ckip-transformer,發表日期2020年11月18日。