研究生: |
陳柏瑋 Chen, Po-Wei |
---|---|
論文名稱: |
利用機器學習填補遺漏值的比較與研究 Comparison of multiple machine-learning methods of imputation |
指導教授: |
呂翠珊
Lu, Tsui-Shan |
口試委員: |
蔡碧紋
Tsai, Pi-Wen 吳宗軒 Wu, Chung-Hsuen 呂翠珊 Lu, Tsui-Shan |
口試日期: | 2022/06/23 |
學位類別: |
碩士 Master |
系所名稱: |
數學系 Department of Mathematics |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 33 |
中文關鍵詞: | 遺漏值 、機器學習 、K-鄰近算法 、鏈式方程多重填補法 、缺失森林 |
英文關鍵詞: | Imputation of missing values, K-Nearest Neighbor, Multivariate Imputation by Chained Equations, MissForest |
DOI URL: | http://doi.org/10.6345/NTNU202201080 |
論文種類: | 學術論文 |
相關次數: | 點閱:104 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究主要探討具有遺漏值的數據通過多種機器學習方法填補後之比較。遺漏值的填補是進行資料分析的重要過程,若隨意刪除或簡易替換,可能會導致後續的統計分析出現重大偏差,因此,在可用的填補方法中進行有效的選擇至關重要。
我們利用近期熱門的機器學習填補法 K-鄰近算法 (K-Nearest Neighbor)、鏈式方程多重填補法 (Multivariate Imputation by Chained Equations) 及缺失森林 (MissForest) 等三種方法進行了模擬研究。在各種隨機遺漏設置下,當數據是完全續、完全類別或混合型數據集時,以評估每種方法的各自結果,結果表明,利用缺失森林 (MissForest) 方法來對資料進行填補時,其正規化方根均差 (NRMSE) 或是類別錯誤率 (PFC) 都有著最好的表現。我們還將三種方法應用於幾個實徵數據集上,結果顯示缺失森林皆優於其他兩種機器學習填補法。
This study explores the comparison of data with missing values after imputation by
multiple machine-learning methods. The imputation of missing values is an important process in data analysis. If the missing values are arbitrarily deleted or simply substituted, it may lead to substantial bias in the subsequent statistical analysis. Therefore, the effective selection among available imputation methods is extremely crucial.
In this paper, we consider the recent machine-learning imputation methods, K-Nearest Neighbor, Multivariate Imputation by Chained Equations and MissForest. We conduct simulation studies for all-continuous, all-categorical and mixed data to evaluate the respective results from each method under various settings of random omission. The results show that the MissForest method has the best performance in terms of NRMSE and PFC. We also apply three methods to several real data sets.
[1] Roderick JA Little and Donald B Rubin. (2019). Statistical analysis with missing data. John Wiley & Sons.
[2] Daniel J Stekhoven and Peter Bühlmann. (2012). Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
[3] Shigeyuki Oba, Masa-aki Sato, Ichiro Takemasa, Morito Monden, Ken-ichi Matsubara, and Shin Ishii. (2003). A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096.
[4] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525.
[5] Stef Van Buuren and Karin Oudshoorn. (1999). Flexible multivariate imputation by MICE. Leiden: TNO.
[6] Leo Breiman. (2001). Random forests. Machine learning, 45(1):5–32.
[7] Stef Van Buuren. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical methods in medical research, 16(3):219– 242.
[8] Stef Van Buuren and Karin Groothuis-Oudshoorn. (2011). mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45:1–67.
[9] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. (2003). Knn model-based approach in classification. In OTM Confederated International Conferences ”On the Move to Meaningful Internet Systems”, pp:986–996.