簡易檢索 / 詳目顯示

研究生: 鍾汶育
Chung, Wen-Yu
論文名稱: 透過探索性資料分析搭配特徵工程提高機器學習方法之預測力–以信用貸款資料為例
Improving the Predictive Power of Machine Learning Methods through Exploratory Data Analysis and Feature Engineering – Taking Credit Loan Data as an Example
指導教授: 何宗武
Ho, Tsung-Wu
學位類別: 碩士
Master
系所名稱: 全球經營與策略研究所
Graduate Institute of Global Business and Strategy
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 47
中文關鍵詞: 異質性資料分群預測改善交叉驗證法
英文關鍵詞: heterogeneity, data segmentation, prediction improvement, cross-validation
DOI URL: http://doi.org/10.6345/NTNU202001095
論文種類: 學術論文
相關次數: 點閱:186下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究的目的是希望透過探索式資料分析方法,將變數轉換以及新增變數,利用統計方法將資料做分群後,再觀察不同的模型預測各群資料的效能,以證明透過這樣的方法可降低資料整體的異質性,並在各群資料分別做模型訓練,最後依照各項預測指標來衡量模型的預測效能,以達到改善預測的目的。在本篇研究中,我們會以Kaggle網站上的銀行信用貸款資料---Give Me Some Credit作為我們的研究資料,資料內容包含與借款人個人基本資料與其個人融資相關資料,其目標變數為「是否有超過90天或更長時間逾期未還貸款的不良行為」。而在分析過程中,我們利用卡方檢定觀察到信用貸款在線數量與債務比例發現兩變數之間互不獨立,然而在分割資料前,我們利用差異性檢定找出分割資料的邊界,可得到信用在線數量於0-12,其違約與未違約的借款人,在債務比例上有顯著的差異,然而信用在線數量於13-56的借款人,在這兩類別的借款者,債務比例並沒有顯著的差異,因此將資料於公開貸款與信用在線數量以12為邊界將資料分為兩群,分別以羅吉斯迴歸、決策樹、K最近鄰法隨機森林以交叉驗證法做模型的訓練以及預測,預測的評估指標有準確率(Accuracy)、召回率(Recall)、精確率(Precision)、F1 Score與接受者操作特徵曲線(Receiver Operating Characteristic Curve)的AUC(Areas Under the Curve)值,我們利用這五項指標評估不同模型在兩群資料的預測表現並比較其差異是否顯著。研究結果顯示K最近鄰法在「沒有缺值」的資料預測改善效果最明顯,無論是群一、群二、群三除了召回率外都有約10% - 15%左右的改善效果,而召回率改善最多的為群二-約有7%的改善。而各群之間在預測效果上的比較上,群一在決策樹、K最近鄰法、隨機森林優於群二約4%-10%,代表透過分群後的預測效果是有差異的。而在「僅月收入為缺值」與「月收入和家眷數量皆為缺值」的資料可能因為缺少關鍵變數或因資料筆數不足而沒有分群,以至於在預測效果的改善上並不如「沒有缺值」的資料來的理想。

    The purpose of the study is to explore the relationship between variables through exploratory data analysis method to do variable transformation and create new variables in interaction terms. And using statistical methods to split the data into two datasets in order to reduce the heterogeneity of the original datasets. After splitting, we should train different models through the datasets respectively and measure the effectiveness of model prediction to achieve the purpose of improving prediction. In this study, we will use the credit loan data-Give Me Some Credit from Kaggle as our research datasets, and its target variable is "Person who Experienced 90 days Past Due Delinquency or Worse ". In the process of analyzing, we use chi-squared test to observe that Debt Ratio and Number Of Open Credit Lines And Loans aren’t independent. Before partitioning the datasets, we use t test to find the boundary to segment the data and know that there are significant difference in Debt Ratio between default and non-default borrowers having 0-12 open credit lines and loans. However, the borrowers having 13-56 open credit lines and loans between them have no significant difference in Debt Ratio. Therefore, we set 12 as our boundary in open credit lines and loans to divide the datasets into two groups and using logistic regression, decision tree, K-nearest neighbor and random forest to do model training through cross-validation respectively. To evaluate the predictive performance of the models, we choose Accuracy, Recall, Precision, F1 Score and AUC as our evaluative indicators. These five indicators were used to assess the predictive performance of different models in the two groups of data and compare whether the differences are significant. The results showed that K-nearest neighbor had the most obvious improvement in "No Missing Value" data. Except recall rate, Group I-III had about 10%-15% in improvement, and Group II had about 7% improvement. In comparison of prediction effects among groups, Group I was about 4%-10% better than Group 2 in decision tree, K-nearest neighbor method and random forest, indicating that there were differences in prediction effects after grouping. The data of "Only Monthly Income is Missing Value" and "Monthly Income and Dependent Quantity are Missing Value" may not be grouped due to the lack of key variables or the insufficient number of data, so that the prediction effect is not as ideal as the data of "No Missing Value".

    中文摘要 i 目錄 iv 圖目錄 vi 表目錄 vii Chapter 1 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究流程 3 Chapter 2 文獻探討 4 2.1 資料探勘 4 2.1.1 監督式學習 4 2.1.2 監督式學習–改善預測 5 2.2 銀行信用風險 6 2.2.1 傳統銀行信用風險評估 6 2.2.2 以機器學習方法評估信用風險 7 2.3 不平衡資料的處理 8 2.3.1 不平衡資料的問題 8 2.3.2 不平衡資料的處理方式-過採樣與欠採樣 9 2.4 透過分群後的資料改善預測 9 Chapter 3 研究方法 11 3.1 研究架構 11 3.2 不平衡資料處理方法-欠採樣 12 3.3 交叉驗證–K-fold Cross Validation 13 3.4 簡單演算法 13 3.4.1 羅吉斯迴歸 14 3.4.2 決策樹Decision Tree-Cart 15 3.4.3 K最近鄰演算法 16 3.5 集成學習法 18 3.5.1 隨機森林法 18 3.6 模型預測效果評估 19 3.6.1 混淆矩陣 19 3.6.2 接受者操作特徵曲線 21 3.7 分析工具 22 Chapter 4 實證結果 23 4.1 資料描述 23 4.1.1 問題發現---探索式資料分析 25 4.1.2 欠採樣 30 4.2 資料預處理 31 4.2.1 變數轉換 31 4.2.2 新增變數:交互特徵 35 4.3 模型預測能力評估 36 Chapter 5 結論 43 5.1 研究結果 43 5.2 研究貢獻 44 5.3 研究限制 45 參考文獻 46

    英文文獻
    Antipov, E., & Pokryshevskaya, E. (2010). Applying CHAID for logistic regression diagnostics and classification accuracy improvement. Journal of Targeting, Measurement and Analysis for Marketing, 18(2), 109-117.

    Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

    Choi, J. M. (2010). A selective sampling method for imbalanced data learning on support vector machines.

    Deodhar, M., & Ghosh, J. (2008). Simultaneous co-segmentation and predictive modeling for large, temporal marketing data. Paper presented at the 2008 IEEE International Conference on Data Mining Workshops
    .
    Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.

    Ghatasheh, N. (2014). Business analytics using random forest trees for credit risk prediction: A comparison study. International Journal of Advanced Science and Technology, 72(2014), 19-30.

    Kankanige, Y., & Bailey, J. (2014). Improved Feature Transformations for Classification Using Density Estimation. Paper presented at the Pacific Rim International Conference on Artificial Intelligence.

    Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160, 3-24.

    Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62, 22-31.

    Kuhn, M., & Johnson, K. (2013). Applied predictive modeling: Springer.

    Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(5), 1-26.

    Sakar, C. O., Polat, S. O., Katircioglu, M., & Kastro, Y. (2018). Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893-6908. doi: 10.1007/s00521-018-3523-0

    Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: principles and techniques for data scientists: " O'Reilly Media, Inc.".

    中文文獻
    李天行, 陳怡妃, 施讓龍, & 呂奇傑. (2014). 運用集成學習分類架構預測信用貸款購買行為. Journal of Data Analysis, 9(6), 1-26.

    李桐豪, & 呂美慧. (2000). 金融機構房貸客戶授信評量模式分析-Logistic 迴歸之應用. 台灣金融財務季刊, 1(1), 1-20.

    郭珉辰. (2019). 資料探勘技術在信用卡不平衡資料上之應用. 淡江大學. Available from Airiti AiritiLibrary database.

    簡禎富, 工業管理, 許嘉裕, & 工業管理. (2014). 資料挖礦與大數據分析: 前程文化.

    下載圖示
    QR CODE