簡易檢索 / 詳目顯示

研究生: 洪坊瑜
Hong, Fang-Yu
論文名稱: 利用隨機交互森林預測模型之應用
Applications of Predictive Models Using Random Interaction Forests
指導教授: 程毅豪
Chen, Yi-Hau
呂翠珊
Lu, Tsui-Shan
口試委員: 林惠文
Lin, Hui-Wen
程毅豪
Chen, Yi-Hau
呂翠珊
Lu, Tsui-Shan
口試日期: 2023/06/20
學位類別: 碩士
Master
系所名稱: 數學系
Department of Mathematics
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 43
中文關鍵詞: 交互作用隨機森林隨機交互森林機器學習迴歸分析
英文關鍵詞: interaction effect, random forests, random interaction forests, machine learning, regression analysis
研究方法: 實驗設計法次級資料分析調查研究主題分析比較研究觀察研究內容分析法
DOI URL: http://doi.org/10.6345/NTNU202300737
論文種類: 學術論文
相關次數: 點閱:129下載:16
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 根據生物、工業,以及商業統計資料,對於不同領域下的預測分析,舉例客戶行為、消費者需求或股票價格波動以及診斷病人等等,從中探討重要變數之間的交互作用,達到模型更準確的預測結果,本研究套用了隨機森林演算法,考慮交互效應予以改善模型並允許對解釋變數做交互作用進行有價值的洞察效果,而隨機交互作用森林(Random Interaction Forest, RIF)是隨機森林(Random Forest, RF)所衍生出來的一種新策略演算法,適合用於類別、連續變數或存活等資料型態加以預測,並明確地模擬建構森林中的決策樹所執行變數之間定性與定量的相互作用。
    在模擬研究中,使用了R包套件中"vivid"(Variable Importance and Variable Interactions Displays),呈現了機器學習模型中變數之間的重要性以及交互作用的可視覺化工具,同時也使用了R包中"diversityForest",透過投票分割抽樣,在隨機森林中進行複雜的分類程序,使用雙變數拆分對定量和定性交互效應進行建模。
    交互森林(Interaction Forest, IF)帶有效果重要性度量(Effect Importance Measure, EIM),可用於識別具有高預測相關性的定量和定性交互作用的變數做應對。IF和EIM專注於易於解釋的交互形式。透過新的隨機交互森林結構,檢驗了線性迴歸模型、邏輯迴歸模型,增添了機器學習預測模型的能力。研究結果表明,當RIF模型存在交互作用時,不僅優於隨機森林和邏輯、迴歸分析方法。同時,證實RIF在執行許多情況下比傳統統計方法所創建的模型識別來的更為準確。並且交互作用為顯著時,RIF的性能也顯得更加優越表現,表示使用此方法不但可以提高業務流程和科學研究的效率。而且RIF在預測建模中的辨識度以及利用交互效果的部分都相對容易解釋,這是一項具有挑戰性且合適的工具。本文將透過這些方法的檢測應用於2012~2016年台北市死亡數實際資料進行評估。

    According to biological, industrial, and commercial statistical data, for predictive analysis in different fields, such as customer behavior, consumer demand or stock price fluctuations, and patient diagnosis etc., we can explore the interaction between important variables to achieve a more accurate model. To predict the results, this thesis applies the random forest algorithm, considers the interaction effect to improve the model and allows valuable insight into the interaction of explanatory variables. The random interaction forest (RIF) is a random forest and it is a new strategy of algorithm, suitable for categorical, continuous and survival prediction outcomes. It explicitly models the qualitative and quantitative interactions between variables implemented by decision trees in construction forests.
    In the simulation study, "Vivid" (Variable Importance and Variable Interactions Displays) in the R package was used to present a visualization tool for the importance and interaction between variables in the machine learning model, and "diversityForest" in the R package was also used, with split sampling by vote, complex classification procedures in random forests, modeling quantitative and qualitative interaction effects using bivariate splits.
    The interactional forest with an effect importance measure (EIM) can be used to identify variable responses for quantitative and qualitative interactions with high predictive correlations. Feature Interaction (FI) and EIM focus on easily interpretable forms of interaction. Through the new random interaction forest structure, the linear regression model and logistic regression model are tested, and the ability of the machine learning prediction model is added. The results of the simulation show that the RIF model is not only superior to the random forest and logistic and regression analysis methods, but also gives more accurate results than models created by traditional statistical methods. When the interaction is more significant, the performance of RIF is more superior, indicating that this method can improve the efficiency of business processes and scientific research. Moreover, RIF's recognizability in predictive the model and use of interaction effects are relatively easy to interpret. We believe that it is a challenging and suitable tool in the future. In this paper, the prediction is applied to the actual data of the number of deaths in Taipei City from 2012 to 2016 for evaluation by the method.

    致謝 i 中文摘要 iii Abstract iv 目錄 vi 表目錄 viii 圖目錄 ix 第一章 介紹 1 第一節 交互作用的原理 1 第一項 交互作用線性模型的一般式 1 第二項 連續型變數和二分類變數之間的交互作用 2 第三項 兩個二分類變數之間的交互作用 3 第四項 兩個連續型變數之間的交互作用 3 第五項 主效用和交互作用分析 4 第二節 隨機森林理論 4 第一項 判別函數 5 第三節 研究動機與探討 6 第二章 研究方法 8 第一節 機器學習的方法 8 第一項 裝袋法Bagging 8 第二項 決策樹 Decision Tree 9 第三項 支持向量機 SVM 10 第四項 K-最近臨算法 KNN 11 第五項 特徵交互的極端梯度提昇機器XGB-FI 11 第二節 選擇重要變數R套件 12 第一項 vivid package 12 第二項 diversityForest package 14 第三節 隨機交互森林 14 第三章 資料分析 16 第一節 資料背景 16 第二節 資料探勘 16 第三節 研究目的與應用 22 第四章 結果與討論 36 第五章 結論與未來發展 39 參考文獻 41

    Ho, Tin Kam. “Random Decision Forest”. Proc. of the 3rd Int'l Conf. on Document Analysis and Recognition, Montreal, Canada, August 14-18, 278-282, 1995.
    Ho, Tin Kam. “The Random Subspace Method for Constructing Decision Forests.” IEEE Trans. Pattern Anal. Mach. Intell. 20: 832-844, 1998.
    Breiman, L. “Random Forests.” Machine Learning 45: 5-32, 2001.
    Berlind, Roger Steven. “An alternative method of stochastic discrimination with applications to pattern recognition.”: 4878-4878, 1995.
    Cutler, Adele and Guo hua Zhao. “PERT – perfect random tree ensembles.” Computing Science and Statistics 33.4:90-4, 2001.
    Ho, Tin Kam. “Recognition of handwritten digits by combining independent learning vector quantizations.” Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR'93). IEEE, 1993.
    Cutler, D. Richard, Thomas C. Edwards, Karen H. Beard, Adele Cutler, Kyle Hess, Jacob Gibson and Joshua J. Lawler. “Random forests for classification in ecology.” Ecology 88 11: 2783-92, 2007.
    Tomasi, Carlo. “Decision Trees and Random Decision Forests.”, 2021.
    Zhen Zeng, Yuefeng Lu, Judong Shen, Wei Zheng, Peter Shaw, Mary Beth Dorr. “A random interaction forest for prioritizing predictive biomarkers.” arXiv preprint arXiv:1910.01786, 2019.
    Loh, Wei-Yin. “Classification and regression trees.” Wiley interdisciplinary reviews: data mining and knowledge discovery 1.1: 14-23, 2011.
    Guo, Chao-Yu and Yi-Jyun Lin. “Random Interaction Forest (RIF)–A Novel Machine Learning Strategy Accounting for Feature Interaction.” IEEE Access 11: 1806-1813, 2023.
    Cutler, Adele, D. Richard Cutler, and John R. Stevens. “Random forests.” Ensemble machine learning: Methods and applications:157-175, 2012.
    Hornung, Roman and Anne‐Laure Boulesteix. “Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects.” Compute Statistics and Data Analysis 171: 107460, 2022.
    Inglis, Alan, Andrew Parnell, and Catherine B. Hurley. “Visualizing variable importance and variable interaction effects in machine learning models.” Journal of Computational and Graphical Statistics, 31(3), 766-778, 2022.
    McClelland, Gary H. and Charles M Judd. “Statistical difficulties of detecting interactions and moderator effects.” Psychological bulletin 114 2: 376-90, 1993.
    Zhang, Jiong and Mohammad Zulkernine. “A hybrid network intrusion detection technique using random forests.” First International Conference on Availability, Reliability and Security (ARES'06). IEEE, 2006.
    Chauhan, Vinod Kumar, Kalpana Dahiya and Anuj Kumar Sharma. “Problem formulations and solvers in linear SVM: a review.” Artificial Intelligence Review: 1-53, 2019.
    Denisko, Danielle, and Michael M. Hoffman. “Classification and interaction in random forests.” Proceedings of the National Academy of Sciences 115.8: 1690-1692, 2018.
    Benarie, Michel. “Interactions between air contaminants and forest ecosystems." Science of the Total Environment 29.1-2: 187-188, 1983.
    Zhang, Haotong, Alexander C. Berg, Michael Maire and Jitendra Malik. “SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition.” 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) 2: 2126-2136, 2006.
    Breiman, Leo. “Bagging predictors.” Machine learning 24: 123-140, 1996.
    Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng and Yutian Li. “Extreme Gradient Boosting [R package xgboost version 1.2.0.1].”, 2020.
    Althuwaynee, Omar F., Sang-Wan Kim, Mohamed A. Najemaden, Ali Aydda, Abdul-Lateef Babatunde Balogun, Moatasem M. Fayyadh and Hyuck-Jin Park. “Demystifying uncertainty in PM10 susceptibility mapping using variable drop-off in extreme-gradient boosting (XGB) and random forest (RF)algorithms.” Environmental Science and Pollution Research 28: 43544 – 43566, 2021.
    Wälder, Konrad and Olga Wälder. “Analysing interaction effects in forests using the mark correlation function.” Iforest - Biogeosciences and Forestry 1.1: 34, 2008.
    Guyon, Isabelle M and André Elisseeff. “An Introduction to Variable and Feature Selection.” J. Mach. Learn. Res. 3: 1157-1182, 2003.
    Wright, Marvin N., and Andreas Ziegler. “A. ranger: A fast implementation of random forests for high dimensional data in C++ and R.” arXiv preprint arXiv:1508.04409, 2015.
    Yuan, Ye, Liji Wu, and Xiangmin Zhang. “Gini-Impurity index analysis.” IEEE Transactions on Information Forensics and Security, 16, 3154-3169, 2021.
    Guo, Chao-Yu, and Ke-Hao Chang. “A Novel Algorithm to Estimate the Significance Level of a Feature Interaction Using the Extreme Gradient Boosting Machine.” International journal of environmental research and public health 19.4: 2338, 2022.
    Friedman, Jerome H., and Bogdan E. Popescu. "Predictive learning via rule ensembles." The annals of applied statistics: 916-954, 2008.
    Hastie, Trevor, Robert Tibshirani, Jerome H. Friedman, and Jerome H. Friedman. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: springer, 2009.

    下載圖示
    QR CODE