研究生: |
黃冠綸 Huang, Guan-Lun |
---|---|
論文名稱: |
中文期刊論文資訊擷取之研究 — 以圖書資訊學領域為例 Information Extraction From Chinese Scientific Article — A Case Study of Library and Information Science |
指導教授: |
曾元顯
Tseng, Yuen-Hsien |
口試委員: |
柯皓仁
Ke, Hao-Ren 李龍豪 Lee, Lung-Hao 曾元顯 Tseng, Yuen-Hsien |
口試日期: | 2023/01/12 |
學位類別: |
碩士 Master |
系所名稱: |
圖書資訊學研究所 Graduate Institute of Library and Information Studies |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 63 |
中文關鍵詞: | 資訊擷取 、開放原始碼 、GROBID 、全文資料集 |
英文關鍵詞: | Information Extraction, Open Source, GROBID, Full Text Dataset |
研究方法: | 系統開發 |
DOI URL: | http://doi.org/10.6345/NTNU202301549 |
論文種類: | 學術論文 |
相關次數: | 點閱:89 下載:16 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
目前的科學文獻數量以相當驚人的速度在成長當中,如何將這些巨量、富 含知識的科學文獻內容從 PDF 中剖析出來,是當前相當重要的課題。然而在臺 灣鮮少看到有相關的研究,本研究的目的在於提出臺灣中文學術期刊資訊擷取 的解決方案,並以圖書資訊學領域期刊論文為例。
本研究透過重新訓練開放原始碼科學文獻剖析工具 GROBID,達成擷取中 文學術期刊資訊(篇名、作者、摘要、關鍵字、具章節邏輯的內文等)的目 的,並透過十倍交叉驗證法(10 Fold Cross-Validation)來評估訓練成效。本研究 透過重新訓練後的模型剖析 725 篇台灣圖書資訊領域期刊論文,觀察與分析可 能影響剖析成功率的原因。
本研究發現,三個模型(Segmentation、Header、Fulltext)在訓練資料 n = 100 與 n = 250 時, F1 score 沒有特別明顯的成長。相同期刊的論文會因為不同 年代出版而有不同的版型,這個現象對於剖析成功率有影響。
本研究透過將剖析後的科學文獻內文匯入QA系統中,使得QA系統可以 回答更專業的問題,作為對剖析科學文獻後的加值利用範例。
The current volume of scientific literature is growing astonishingly, and the extraction of the vast amount of knowledge-rich content from scientific article PDFs has become a critical issue. However, there is a scarcity of research focusing on this area in Taiwan. This study aims to propose a solution for extracting information from Chinese academic journals in Taiwan, using the field of library and information science as an example.
This study successfully extracts Chinese academic journal information by retrain- ing the open-source scientific literature parsing tool GROBID, including article titles, authors, abstracts, keywords, and structured full text with logical sections. The effectiveness of the training is evaluated using a ten-fold cross-validation method. The retrained model is applied to analyze 725 journal articles in the library and information science field in Taiwan, observing and analyzing factors that may affect the success rate of parsing.
The study found that the three models (Segmentation, Header, Fulltext) did not significantly improve the F1 score when trained on n = 100 and n = 250 data samples. The variation in document layouts due to different publication years of articles within the same journal impacts the parsing success rate.
Finally, we Incorporate the parsed scientific literature into a Question-Answering (QA) system, making an example of the added value of parsed scientific literature.
吳素研、吳江瑞、李文波 (2020)。 大規模科技文獻深度解析和檢索平臺構建。現代情報(01)。
陳光華 (2010)。 學術會議資訊之擷取及其應用。中文計算語言學期刊, 15(3-4),頁 237-262。 doi: 10.30019/ijclclp.201009.0005
薛歡歡、趙瑞雪、寇遠濤、鮮國建 (2019)。 農業中文期刊論文信息自動識別與抽取模型構建及實現。情報工程(06)。
林巧敏. (1995). 參考資料庫建置方式之評估. 國立中央圖書館臺灣分館館刊, 2(1), 41-52. Retrieved from https://www.ntl.edu.tw/public/Attachment/992611283674.pdf
陳曉理(1995) 。全文資料庫。在圖書館學與資訊科學大辭典。台北市:華泰文化。
楊平(2021)。開放領域中文問答系統之建置與評估。﹝碩士論文。國立臺灣師範大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/92f45f。
劉春銀(1995) 。書目資料庫。在圖書館學與資訊科學大辭典。台北市:華泰文化。
Web of Science (n.d.)。 Web of Science 期刊遴選流程與標準。 最後檢索日期 2023/1/9,檢索自:https://clarivate.com/zh-hant/products/scientific-and-academic-research/research-discovery-and-workflow-solutions/web-of-science/core-collection/editorial-selection-process/editorial-selection-process/
Scopus (n.d.)。 關於 Scopus - 期刊收錄. 最後檢索日期 2023/1/9,檢索自:https://www.elsevier.com/zh-tw/solutions/scopus
Ahmad, R., Afzal, M. T., & Qadir, M. A. (2016). Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats. Paper presented at the Semantic Web Challenges, Cham.
Ahmer, M. H., & Hashmi, A. (2020). Insights to the state-of-the-art PDF Extraction Techniques. IPSI Transaction on Internet Research, 16. Retrieved from http://tir.ipsitransactions.org/indexTIR_spec.php?id=36
Councill, I. G., Lee Giles, C., & Kan, M. Y. (2008). ParsCit: An open-source CRF reference string parsing package. Paper presented at the Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008.
Day, M.-Y., Tsai, R. T.-H., Sung, C.-L., Hsieh, C.-C., Lee, C.-W., Wu, S.-H., . . . Hsu, W.-L. (2007). Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 43(1), 152-167. doi:10.1016/j.dss.2006.08.006
Ferrés, D., Saggion, H., Ronzano, F., & Bravo, À. (2019). PDFdigest: An adaptable layout-aware PDF-to-XML textual content extractor for scientific articles. Paper presented at the LREC 2018 - 11th International Conference on Language Resources and Evaluation.
Grishman, R. (2019). Twenty-five years of information extraction. Natural Language Engineering, 25(6), 677-692. doi:10.1017/S1351324919000512
Guo, Z., & Jin, H. (2011). A Rule-Based Framework of Metadata Extraction from Scientific Papers. Paper presented at the 2011 10th International Symposium on Distributed Computing and Applications to Business, Engineering and Science.
Hashmi, A. M., Afzal, M. T., & Rehman, S. U. (2020). Rule based approach to extract metadata from scientific PDF documents. Paper presented at the CITISIA 2020 - IEEE Conference on Innovative Technologies in Intelligent Systems and Industrial Applications, Proceedings.
Huang., G.-L., Yang., P., & Tseng, Y.-H. (2022). Information Extraction from Chinese Scientific Articles in LIS: A Preliminary Result. Paper presented at the nternational Conference and Institute on AI and Blockchain (ICAIB 2022) for Information and Library Science - Challenges and Possibilities 2022, online.
Hui, H., Giles, C. L., Manavoglu, E., Hongyuan, Z., Zhenyue, Z., & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines. Paper presented at the 2003 Joint Conference on Digital Libraries, 2003. Proceedings.
Jude, P. M. (2020). Increasing accessibility of Electronic Theses and Dissertations (ETDs) through chapter-level classification. Virginia Tech,
Kern, R., Jack, K., Hristakeva, M., & Granitzer, M. (2012). Teambeam-meta-data extraction from scientific literature. D-Lib Magazine, 18(7), 1.
Kershaw, D. J., & Koeling, R. (2020). Elsevier OA CC-By Corpus. ArXiv, abs/2008.00774.
Lipinski, M., Yao, K., Breitinger, C., Beel, J., & Gipp, B. (2013). Evaluation of header metadata extraction approaches and tools for scientific PDF documents. Paper presented at the Proceedings of the ACM/IEEE Joint Conference on Digital Libraries.
Lopez, P. (2009) GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Vol. 5714 LNCS. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (pp. 473-474).
Nasar, Z., Jaffry, S. W., & Malik, M. K. (2018). Information extraction from scientific articles: a survey. Scientometrics, 117(3), 1931-1990. doi:10.1007/s11192-018-2921-5
Nasar, Z., Jaffry, S. W., & Malik, M. K. (2019). Tagging Assistant for Scientific Articles. Paper presented at the Intelligent Technologies and Applications, Singapore.
Peng, F., & McCallum, A. (2004). Accurate Information Extraction from Research Papers using Conditional Random Fields, Boston, Massachusetts, USA.
Prasad, A., Kaur, M., & Kan, M. Y. (2018). Neural ParsCit: a deep learning-based reference string parser. International Journal on Digital Libraries, 19(4), 323-337. doi:10.1007/s00799-018-0242-1
Saier, T., & Färber, M. (2020). unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata. Scientometrics, 125(3), 3085-3108. doi:10.1007/s11192-020-03382-z
Sarawagi, S. (2007). Information extraction. Foundations and Trends in Databases, 1(3), 261-377. doi:10.1561/1900000003
Small, S.G., & Medsker, L. (2014). Review of information extraction technologies and applications. Neural Computing & Applications, 25, 533–548. https://doi.org/10.1007/s00521-013-1516-6
Tkaczyk, D., Collins, A., Sheridan, P., & Beel, J. (2018). Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. Paper presented at the Proceedings of the ACM/IEEE Joint Conference on Digital Libraries.
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, Ł. (2015). CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 18(4), 317-335. doi:10.1007/s10032-015-0249-8
Wu, J., Killian, J., Yang, H., Williams, K., Choudhury, S. R., Tuarob, S., . . . Giles, C. L. (2015). PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search. Paper presented at the Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015.
Wu, J., Williams, K. M., Chen, H.-H., Khabsa, M., Caragea, C., Tuarob, S., . . . Giles, C. L. (2015). CiteSeerX: AI in a Digital Library Search Engine. AI Magazine, 36(3), 35-48. doi:10.1609/aimag.v36i3.2601
Yu, J., & Fan, X. (2007). Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields. Paper presented at the Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).