簡易檢索 / 詳目顯示

研究生: 翁廷毅
Weng, Ting-I
論文名稱: 類別不平衡下文件分類之多策略漸進訓練
Multi-Strategy Incremental Training for Unbalanced Document Classification
指導教授: 柯佳伶
Koh, Jia-Ling
口試委員: 柯佳伶
Koh, Jia-Ling
吳宜鴻
Wu,Yi-Hung
沈錳坤
Shan, Man-Kwan
口試日期: 2024/10/18
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 113
語文別: 中文
論文頁數: 52
中文關鍵詞: 類別原型學習心理健康素養
英文關鍵詞: Prototype Learning, Mental Health Literacy
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202401981
論文種類: 學術論文
相關次數: 點閱:46下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 社交媒體的廣泛應用產生了大量使用者生成的內容,這些資料為分析群眾對特定主題的態度提供了寶貴來源。然而,網路論壇中的發文內容非常廣泛,其中絕大多數與目標主題無關,這類內容可被歸為0類別。與主題相關的細分類別相較,0類別與其他類別的資料分布極不平衡。因此,建構一個文本分類器可同時辨識與主題無關及相關的各類別是一項重大挑戰。本研究針對此問題提出一個名為多策略漸進訓練(MSIT)的半監督學習方法。MSIT 將類別原型學習的概念融入訓練策略中,首先透過初始訓練階段與增強訓練階段構建基礎模型,並採用 R-Drop 訓練策略及區隔原型訓練策略,以提升模型的表示法學習能力。隨後,MSIT 利用基礎模型對未標示資料進行偽標示,並根據資料表示法與類別原型的相似性,選取部分偽標示資料加入訓練。在自我訓練過程中,MSIT並考慮模型預測與類別原型預測的一致性,進一步優化分類器的參數。在心理健康素養態度分類任務上,本研究使用社交媒體平台蒐集的文件資料進行的實驗結果顯示,MSIT 的兩階段基礎訓練策略顯著提升了基礎模型的性能,其在預測主題相關類別的 F1 指標上,表現優於以相關研究方法建構的基準模型。進一步引入未標示資料進行半監督學習後,MSIT 的預測效果更為顯著,達到最佳 macro-F1 值 0.663,並在多個資料集上表現出穩定的預測能力,整體效能優於相關研究模型。

    The widespread use of social media has generated a vast amount of user-generated content, providing valuable data for analyzing users' attitudes on specific topics. However, the content in forum posts is highly diverse, with most of it unrelated to the target topic, which is categorized as "irrelevant" with a score of 0. When integrating the category with a score of 0 and the topic-relevant categories to be detected, the data distribution among categories becomes extremely imbalanced. This imbalance poses a significant challenge in constructing a text classifier that can simultaneously identify unrelated and topic-related categories. To mitigate the challenges posed by data imbalance and improve classification performance, this study proposes a semi-supervised learning approach named Multi-Strategy Incremental Training (MSIT). MSIT incorporates the concept of prototype learning into the training process. Initially, two foundational training stages are employed to construct the base model: an initial training phase and an enhanced training phase. Two training strategies—R-Drop and prototype separation—are utilized to improve representation learning. Subsequently, MSIT leverages the base model to assign pseudo-labels to unlabeled data. A subset of pseudo-labeled data is then selected for inclusion in the training dataset based on the similarity of their data representations to the class prototypes. Additionally, the consistency between model predictions and prototype predictions is considered during the self-training process to further refine the classifier's parameters and enhance its generalization ability. Performance evaluations were conducted on document datasets collected from social media platforms, specifically focusing on tasks related to distinguishing mental health attitudes. Experimental results demonstrate that the proposed two-stage training approach of MSIT effectively enhances the base model. The constructed classifier achieves higher F1 scores for predicting topic-related categories compared to other models proposed in related works. Furthermore, by incorporating unlabeled data into the semi-supervised learning process, performance is further improved, achieving a best macro-F1 score of 0.663, highlighting the effectiveness of incorporating unlabeled data in the semi-supervised learning framework. MSIT also exhibited consistent performance across various datasets, outperforming related works overall.

    摘要 i Abstract ii 目錄 iv 圖目錄 vii 表目錄 viii 第一章 緒論 1 1.1 研究動機與目的 1 1.2 研究方法 3 1.3 論文架構 5 第二章 文獻探討 6 2.1 文本分類技術的演進 6 2.1.1 採用非深度學習之文本分類方法 6 2.1.2 採用深度學習之文本分類方法 7 2.2 增加訓練資料方法 9 2.2.1 實體資料擴增技術 9 2.2.2 資料表示法擴增技術 10 2.2.3 以偽標示資料自我訓練 12 2.3 模型訓練策略 12 2.3.1 多表示法預測一致性訓練策略 12 2.3.2 調整學習順序訓練策略 13 2.3.3 以類別原型輔助分類器訓練 15 第三章 問題定義與模型建構 17 3.1 問題定義 17 3.2 模型架構與系統建構流程 18 3.2.1 模型架構 18 3.2.2 系統建構流程 19 3.3 基礎模型初始訓練 23 3.3.1 建立訓練批次 23 3.3.2 模型訓練策略 24 3.4 基礎模型增強訓練 27 3.5 半監督式自我訓練調整 29 3.5.1 建立自我學習訓練批次 29 3.5.2 偽標示資料列入訓練選取方式 30 3.5.3 類別原型預測一致性訓練策略 31 第四章 實驗評估與討論 33 4.1 實驗資料集說明 33 4.2 實驗參數與環境設定 34 4.3 評估指標及比較基準模型 36 4.3.1 評估指標 36 4.4 比較基準模型 38 4.5 各模型之預測效果比較 39 第五章 結論與未來研究方向 46 參考文獻 48

    [1] J. R. Baker and S. M. Moore. Blogging as a social tool: A psychosocial examination of the effects of blogging. CyberPsychology & Behavior, 11(6):747–749, 2008.
    [2] H. Beyari. The Relationship between Social Media and the Increase in Mental Health Problems. International Journal of Environmental Research and Public Health,20(3):1–11, 2023.
    [3] J. Bien and R. Tibshirani. Prototype selection for interpretable classification. The Annals of Applied Statistics, 5(4):2403 – 2424, 2011.
    [4] S. Chandra Guntuku, A. Buffone, K. Jaidka, J. C. Eichstaedt, and L. H. Ungar. Understanding and measuring psychological stress using social media. In Proceedings of the International AAAI Conference on Web and Social Media, 13(01):214–225, 2019.
    [5] J. Chen, Z. Yang, and D. Yang. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–2157,
    2020.
    [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
    [7] Y. Fan, A. Kukleva, D. Dai, and B. Schiele. Revisiting consistency regularization for semi-supervised learning. International Journal of Computer Vision, 131(3):626–643, 2023.
    [8] S. Fralick. Learning to recognize patterns without a teacher. IEEE Transactions on Information Theory, 13(1):57–64, 1967.
    [9] X. Gu. A self-training hierarchical prototype-based approach for semi-supervised classification. Information Sciences, 535:204–224, 2020.
    [10] Z. S. Harris. Distributional Structure. Springer Netherlands, 1981.
    [11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput.,9(8):1735–1780, 1997.
    [12] D. Hong, T. Wang, and S. Baek. Protorynet - interpretable text classification via prototype trajectories. Journal of Machine Learning Research, 2023.
    [13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278–2324, 1998.
    [14] J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. H. Hoi. Prototypical contrastive learning of unsupervised representations. ArXiv, 2020.
    [15] x. liang, L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, and T.-Y. Liu. R-drop: Regularized dropout for neural networks. In Proceedings of the Neural Information Processing Systems, volume 34, pages 10890–10905, 2021.
    [16] D. Mekala, C. Dong, and J. Shang. LOPS: Learning order inspired pseudo-label selection for weakly supervised text classification. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4894–4908, 2022.
    [17] S. Mukherjee and A. Awadallah. Uncertainty-aware self-training for few-shot text classification. In Proceedings of the Neural Information Processing Systems, volume 33, pages 21199–21212, 2020.
    [18] M. O’Reilly, N. Dogra, N. Whiteman, J. Hughes, S. Eruyar, and P. Reilly. Is social media bad for mental health and wellbeing? exploring the perspectives of adolescents. Clinical Child Psychology and Psychiatry, 2018.
    [19] Y. Pan, T. Yao, Y. Li, Y. Wang, C.-W. Ngo, and T. Mei. Transferrable prototypical networks for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
    [20] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing Management, 24(5):513–523, 1988.
    [21] R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, 2016.
    [22] S. Sharma, Y. Xian, N. Yu, and A. Singh. Learning prototype classifiers for longtailed recognition. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 1360–1368, 2023.
    [23] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4080–4090, 2017.
    [24] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Proceedings of the Neural Information Processing Systems, volume 33, pages 596–608, 2020.
    [25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
    [26] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings
    of the Neural Information Processing Systems, volume 30, 2017.
    [27] E. Turcan and K. McKeown. Dreaddit: A Reddit dataset for stress analysis in social media. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pages 97–107, 2019.
    [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the Neural Information Processing Systems, volume 30, 2017.
    [29] J. Wei and K. Zou. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388,2019.
    [30] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le. Unsupervised data augmentation for consistency training. In Proceedings of the Neural Information Processing Systems, pages = 6256–6268, volume 33, 2020.
    [31] R. Xu, Y. Yu, H. Cui, X. Kan, Y. Zhu, J. Ho, C. Zhang, and C. Yang. Neighborhoodregularized self-training for learning with few labels. In Proceedings of the Thirty- Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, 2023.
    [32] W. Yang, R. Zhang, J. Chen, L. Wang, and J. Kim. Prototype-guided pseudo labeling for semi-supervised text classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 16369–16382, 2023.
    [33] J. Yao, K. Wang, Z. Xu, and J. Yan. Classvector: A parameterized prototype-based model for text classification. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing, page 322–326, 2019.
    [34] H. Zheng, Q. Zhong, L. Ding, Z. Tian, X. Niu, C. Wang, D. Li, and D. Tao. Selfevolution learning for mixup: Enhance data augmentation on few-shot text classification tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8964–8974, 2023.

    下載圖示
    QR CODE