研究生: 鄧莉雅
Teng, Li-Ya
論文名稱: 決策樹分析與羅吉斯迴歸於資料探勘的整合運用:以人事資料與民眾健康影響因素之探討為例
Integration of Decision Tree and Logistic Regression in Data Mining:Examples of Analysis of Personnel Data and the Influence Factors on People’s Health
指導教授: 邱皓政
Chiou, Haw-Jeng
學位類別: 碩士
系所名稱: 全球經營與策略研究所
Graduate Institute of Global Business and Strategy
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 116
中文關鍵詞: 資料探勘決策樹羅吉斯迴歸變數篩選
英文關鍵詞: Data mining, Logistic regression, Decision tree, Variable selection
論文種類: 學術論文
  • 資料是企業組織的重要資產,如何有效進行資料分析與探勘是提升企業運作效能的重要議題。應用資料探勘方法於資料中挖掘與篩選出良好的資訊時,分類是一項重要的工作,而決策樹分析則是最常使用的資料探勘分類技術,然而當投入的變數越多,決策樹分析的執行效能也就受到影響。

    Data is one of the most important assets in an enterprise or organization, and it is a big issue to use data analysis and data mining efficiently to progress the effectiveness of enterprise operation.When applying data mining to dig out or select great information, classification is the main work, and decision tree analysis is the technic of data mining usually used. However, when entering more variables, it would be possibly influenced on the effectiveness of analysis. In order to improve this weakness, we would like to integrate logistic regression into research to raise the effectiveness of classification. With significance test of logistic regression, selecting out some important variables with strong explanatory into decision tree model could progress the effectiveness of analysis, also the rules of practical value. Thus, this article uses decision tree analysis, which is usually applied for data mining, and tries to integrate logistic regression into whole research to realize how variables selection and effectiveness of classification would operate in these two databases. In this research, we choose logistic regression to analyze the data and determine what kind of variables should be used, selecting these variables which possess higher Wald test and more significant as well into decision tree, and compare with the model which is non-selecting variables for the outcome whether the new rules are less or much efficient in the end.
    In part of empirical analysis, the databases resourced the personnel database and Panel Study of Family Dynamics (PSFD) for analysis of decision tree and logistic regression. Especially, there are strong salary variables in the personnel database, so we would analyze the model whether there are strong salary variables in it with this two-steps analysis and compare the outcomes in the end. For PSFD which has the feature of multi-year connected data, we would focus on influence factors on people’s health to analyze and compare several of this multi-year datasets
    We find that, the improtant selected variables are salary-beginning、salary、education and previous experiences in personnel database. When entering variables including strong variables, the outcome doesn’t chang with decision tree analysis if implement variables selection in logistic regression or not, but it could have the classification accuracy rise after deleting these insignificant variables. On the other hand, when it doesn’t include the strong variables, it presents obviously change in decision tree analysis. In PSFD, the improtant variables are marital health and health of father and mother for the primary variables. We find that when intergrating with logistic regression, it could lower the rules for analysis in C5.0. However, due to reduction of variables entering, all the rates about model evaluation do not raise.
    In this research, we would introduce the concept and application of logistic regression and decision tree analysis, and submit the strategy of two-setps analysis as well, and implement this two pratical databases, specificly illustrating the data mining technic could raise the effectiveness of analysis with strategy of variables signicance test in multiple statistic. Finally, we have the discussion of limitation about this reaserch and future study, also sugesstions of application.

    目 錄 摘要……………………………………………………………………………….1 Abstract………………………………………………………………………………….2 目錄…………………………………………………………………………………4 表目錄…………………………………………………………………………….………..5 圖目錄…………………………………………………………………………….………..6 附錄…………………………………………………………………………….…………..6 第一章緒論………………………………………………………………………7 第一節研究背景…………………………………………………………………7 第二節研究目的…………………………………………………………………9 第二章文獻探討…………………………………………………………………….10 第一節資料探勘……………………………………………………………………10 第二節決策樹………………………………………………………………………13 第三節羅吉斯迴歸…………………………………………………………………23 第四節決策樹與羅吉斯迴歸之相關文獻應用……………………………………27 第五節實徵資料庫的相關文獻回顧………………………………………31 第三章研究方法…………………………………………………………………..35 第一節資料來源………………………………………………………………….36 第二節分析方法……………………………………………………………………43 第四章結果與討論…………………….……………………………………………..47 第一節人事資料庫的實徵分析……………………………………………………47 第二節民眾健康影響因素分析……………………………………………………61 第五章結論與建議…………………………………………………………………74 第一節主要研究發現………………………………………………………………74 第二節實務意涵……………………………………………………………………78 第三節研究結論……………………………………………………………………79 第四節研究限制與建議……………………………………………………………80 參考文獻……………………………………………………………………………….82 表目錄 表2-1:混淆矩陣…………………………………………………………………………15 表2-2:決策樹演算法比較………………………………………………………………17 表2-3:二分類別依變數行預測結果表…………………………………………………25 表2-4:決策樹相關文獻應用……………………………………………………………29 表2-5:決策樹與其他模型之準確率評估………………………………………………30 表3-1:人事資料庫之敘述統計表………………………………………………………36 表3-2:2010年華人家庭動態資料庫之敘述統計表……………………………………37 表3-3:華人家庭動態資料庫之研究變數………………………………………………38 表3-4:本研究之混淆矩陣………………………………………………………………45 表4-1:人事資料庫之參數估計─有「薪資資料」………………………………49 表4-2:人事資料庫之參數估計─無「薪資資料」………………………………49 表4-3:篩選變數前包含薪資變數的決策樹分析規則集………………………………50 表4-4:篩選變數後包含薪資變數的決策樹分析規則集……………………………52 表4-5:篩選變數前不包含薪資變數的決策樹分析規則集……………………………54 表4-6:篩選變數後不包含薪資變數的決策樹分析規則集……………………………56 表4-7:人事資料庫之決策樹C5.0變數篩選前後結果比較表………………………58 表4-8:人事資料庫之模型評估…………………………………………………………60 表4-9:華人家庭資料庫之參數估計─2010年RR健康狀況……………………63 表4-10:華人家庭資料庫之參數估計─2008年RR健康狀況.………………………63 表4-11:華人家庭資料庫之參數估計─2006年RR健康狀況………………………63 表4-12:華人家庭資料庫之參數估計─2002年RIII&RIV健康狀況………………64 表4-13:華人家庭資料庫之參數估計─2000年RII健康狀況………………………64 表4-14:華人家庭資料庫之參數估計─2000年RI健康狀況………………………64 表4-15:華人家庭資料庫之羅吉斯迴歸結果彙整………………………………66 表4-16:華人家庭資料庫的決策樹分析規則集………………………………………67 表4-17:華人家庭資料庫之決策樹C5.0變數篩選前後結果比較表…………71 表4-18:華人家庭資料庫之模型評估…………………………………………………73 圖目錄 圖2-1:決策樹…………………………………………………………………………14 圖3-1:研究流程圖……………………………………………………………………35 圖3-2:人事資料庫之決策樹C5.0模型……………………………………………43 圖3-3:調整後的人事資料庫之決策樹C5.0模型…………………………….……44 圖3-4:華人家庭資料庫之決策樹C5.0模型……………………………………….44 圖4-1:決策樹C5.0模型分析─有「薪資資料」…………………………………51 圖4-2:經變數篩選之決策樹C5.0模型分析─有「薪資資料」…………………53 圖4-3:決策樹C5.0模型分析─無「薪資資料」…………………………………55 圖4-4:經變數篩選之決策樹C5.0模型分析─無「薪資資料」………………57 附 錄 附錄一:華人家庭資料庫之羅吉斯迴歸分析…………………………………………88 附錄二:華人家庭資料庫之決策樹分析規則集………………………………………98

