簡易檢索 / 詳目顯示

研究生: 鄧莉雅
Teng, Li-Ya
論文名稱: 決策樹分析與羅吉斯迴歸於資料探勘的整合運用:以人事資料與民眾健康影響因素之探討為例
Integration of Decision Tree and Logistic Regression in Data Mining:Examples of Analysis of Personnel Data and the Influence Factors on People’s Health
指導教授: 邱皓政
Chiou, Haw-Jeng
學位類別: 碩士
Master
系所名稱: 全球經營與策略研究所
Graduate Institute of Global Business and Strategy
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 116
中文關鍵詞: 資料探勘決策樹羅吉斯迴歸變數篩選
英文關鍵詞: Data mining, Logistic regression, Decision tree, Variable selection
論文種類: 學術論文
相關次數: 點閱:278下載:22
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 資料是企業組織的重要資產,如何有效進行資料分析與探勘是提升企業運作效能的重要議題。應用資料探勘方法於資料中挖掘與篩選出良好的資訊時,分類是一項重要的工作,而決策樹分析則是最常使用的資料探勘分類技術,然而當投入的變數越多,決策樹分析的執行效能也就受到影響。
    在實徵分析部分,本研究利用人事資料庫與華人家庭動態資料庫,進行決策樹與羅吉斯迴歸的整合分析,其中人事資料庫帶有薪資的強勢變數,因此將進行有無強勢變數對於二階段分析效能與其影響的比較。至於華人家庭動態資料庫具有多年期固定樣本追蹤調查的特性,因此得以針對民眾健康的影響因素進行多年期資料的分析與比較。
    研究結果發現,在人事資料庫中,影響具有三個水準的職別變數的重要投入變數為起薪、目前薪資、教育程度與過去的資歷。當投入變數包含強勢變數時,執行羅吉斯迴歸變數篩選程序前後的決策樹分析結果並無變化,不過刪除不顯著之變數後,分類準確率向上提升,但是當投入變數未包涵強勢變數時,決策樹分析結果則有明顯變化。在華人家庭動態資料庫的民眾健康之影響因素分析中,對於民眾的健康狀況三種水準的分類,以配偶的健康狀況、與父、母親的健康狀況三項是主要的投入變數,分析結果發現,二階段整合程序使得後續決策樹C5.0分析大幅減少決策規則,增強規則的解釋意義,但也因為減少許多變數投入,分類準確率與其他相關指標並無顯著提升。本研究除了針對羅吉斯迴歸與決策樹分析的原理與應用進行說明,提出兩階段的整合性分析策略,並以兩個實徵資料庫進行實徵分析,具體說明資料探勘技術可配合多變量統計的變數重要性檢定策略來提高分析效能,最後討論了本研究的限制與未來研究與應用上的建議。

    Data is one of the most important assets in an enterprise or organization, and it is a big issue to use data analysis and data mining efficiently to progress the effectiveness of enterprise operation.When applying data mining to dig out or select great information, classification is the main work, and decision tree analysis is the technic of data mining usually used. However, when entering more variables, it would be possibly influenced on the effectiveness of analysis. In order to improve this weakness, we would like to integrate logistic regression into research to raise the effectiveness of classification. With significance test of logistic regression, selecting out some important variables with strong explanatory into decision tree model could progress the effectiveness of analysis, also the rules of practical value. Thus, this article uses decision tree analysis, which is usually applied for data mining, and tries to integrate logistic regression into whole research to realize how variables selection and effectiveness of classification would operate in these two databases. In this research, we choose logistic regression to analyze the data and determine what kind of variables should be used, selecting these variables which possess higher Wald test and more significant as well into decision tree, and compare with the model which is non-selecting variables for the outcome whether the new rules are less or much efficient in the end.
    In part of empirical analysis, the databases resourced the personnel database and Panel Study of Family Dynamics (PSFD) for analysis of decision tree and logistic regression. Especially, there are strong salary variables in the personnel database, so we would analyze the model whether there are strong salary variables in it with this two-steps analysis and compare the outcomes in the end. For PSFD which has the feature of multi-year connected data, we would focus on influence factors on people’s health to analyze and compare several of this multi-year datasets
    We find that, the improtant selected variables are salary-beginning、salary、education and previous experiences in personnel database. When entering variables including strong variables, the outcome doesn’t chang with decision tree analysis if implement variables selection in logistic regression or not, but it could have the classification accuracy rise after deleting these insignificant variables. On the other hand, when it doesn’t include the strong variables, it presents obviously change in decision tree analysis. In PSFD, the improtant variables are marital health and health of father and mother for the primary variables. We find that when intergrating with logistic regression, it could lower the rules for analysis in C5.0. However, due to reduction of variables entering, all the rates about model evaluation do not raise.
    In this research, we would introduce the concept and application of logistic regression and decision tree analysis, and submit the strategy of two-setps analysis as well, and implement this two pratical databases, specificly illustrating the data mining technic could raise the effectiveness of analysis with strategy of variables signicance test in multiple statistic. Finally, we have the discussion of limitation about this reaserch and future study, also sugesstions of application.

    目 錄 摘要……………………………………………………………………………….1 Abstract………………………………………………………………………………….2 目錄…………………………………………………………………………………4 表目錄…………………………………………………………………………….………..5 圖目錄…………………………………………………………………………….………..6 附錄…………………………………………………………………………….…………..6 第一章緒論………………………………………………………………………7 第一節研究背景…………………………………………………………………7 第二節研究目的…………………………………………………………………9 第二章文獻探討…………………………………………………………………….10 第一節資料探勘……………………………………………………………………10 第二節決策樹………………………………………………………………………13 第三節羅吉斯迴歸…………………………………………………………………23 第四節決策樹與羅吉斯迴歸之相關文獻應用……………………………………27 第五節實徵資料庫的相關文獻回顧………………………………………31 第三章研究方法…………………………………………………………………..35 第一節資料來源………………………………………………………………….36 第二節分析方法……………………………………………………………………43 第四章結果與討論…………………….……………………………………………..47 第一節人事資料庫的實徵分析……………………………………………………47 第二節民眾健康影響因素分析……………………………………………………61 第五章結論與建議…………………………………………………………………74 第一節主要研究發現………………………………………………………………74 第二節實務意涵……………………………………………………………………78 第三節研究結論……………………………………………………………………79 第四節研究限制與建議……………………………………………………………80 參考文獻……………………………………………………………………………….82 表目錄 表2-1:混淆矩陣…………………………………………………………………………15 表2-2:決策樹演算法比較………………………………………………………………17 表2-3:二分類別依變數行預測結果表…………………………………………………25 表2-4:決策樹相關文獻應用……………………………………………………………29 表2-5:決策樹與其他模型之準確率評估………………………………………………30 表3-1:人事資料庫之敘述統計表………………………………………………………36 表3-2:2010年華人家庭動態資料庫之敘述統計表……………………………………37 表3-3:華人家庭動態資料庫之研究變數………………………………………………38 表3-4:本研究之混淆矩陣………………………………………………………………45 表4-1:人事資料庫之參數估計─有「薪資資料」………………………………49 表4-2:人事資料庫之參數估計─無「薪資資料」………………………………49 表4-3:篩選變數前包含薪資變數的決策樹分析規則集………………………………50 表4-4:篩選變數後包含薪資變數的決策樹分析規則集……………………………52 表4-5:篩選變數前不包含薪資變數的決策樹分析規則集……………………………54 表4-6:篩選變數後不包含薪資變數的決策樹分析規則集……………………………56 表4-7:人事資料庫之決策樹C5.0變數篩選前後結果比較表………………………58 表4-8:人事資料庫之模型評估…………………………………………………………60 表4-9:華人家庭資料庫之參數估計─2010年RR健康狀況……………………63 表4-10:華人家庭資料庫之參數估計─2008年RR健康狀況.………………………63 表4-11:華人家庭資料庫之參數估計─2006年RR健康狀況………………………63 表4-12:華人家庭資料庫之參數估計─2002年RIII&RIV健康狀況………………64 表4-13:華人家庭資料庫之參數估計─2000年RII健康狀況………………………64 表4-14:華人家庭資料庫之參數估計─2000年RI健康狀況………………………64 表4-15:華人家庭資料庫之羅吉斯迴歸結果彙整………………………………66 表4-16:華人家庭資料庫的決策樹分析規則集………………………………………67 表4-17:華人家庭資料庫之決策樹C5.0變數篩選前後結果比較表…………71 表4-18:華人家庭資料庫之模型評估…………………………………………………73 圖目錄 圖2-1:決策樹…………………………………………………………………………14 圖3-1:研究流程圖……………………………………………………………………35 圖3-2:人事資料庫之決策樹C5.0模型……………………………………………43 圖3-3:調整後的人事資料庫之決策樹C5.0模型…………………………….……44 圖3-4:華人家庭資料庫之決策樹C5.0模型……………………………………….44 圖4-1:決策樹C5.0模型分析─有「薪資資料」…………………………………51 圖4-2:經變數篩選之決策樹C5.0模型分析─有「薪資資料」…………………53 圖4-3:決策樹C5.0模型分析─無「薪資資料」…………………………………55 圖4-4:經變數篩選之決策樹C5.0模型分析─無「薪資資料」………………57 附 錄 附錄一:華人家庭資料庫之羅吉斯迴歸分析…………………………………………88 附錄二:華人家庭資料庫之決策樹分析規則集………………………………………98

    中文部分:
    王鼎銘(2012)。類別依變項的迴歸模型。出自瞿海源、畢恆達、劉長萱、楊國樞主編,社會行為及行為科學研究法(三)資料分析(pp.85-130)。台北:東華書局股份有限公司。
    江豐富(2012)。台灣公私部門薪資差異問題之研究,中央研究院經濟所學術研討論文IEAS Working Paper , No.12-A012.
    何秀雲(2013)。探討工作時數對於工作壓力、工作滿意度與自覺健康狀況之影響。嘉南大學醫務管理學研究所碩士論文。
    吳宣蓓(2011)。工時變遷與過長工時對健康的影響。臺灣大學健康政策與管理研究所碩士論文 。
    車品覺(2014)。大數據的關鍵思考一書。台北:天下雜誌股份有限公司。
    周育蓁(2013)。婚姻對女性薪資的影響─以台灣勞動婦女為例。國立中央大學產業經濟研究所碩士論文,桃園市。
    林惠彥,陸洛,陸昌勤(2014)。人際壓力與工作滿足及身心健康之關聯:以華人因應策略為調節變項。商略學報,6(1),057-072。
    邱皓政(2008)。量化研究法(一):研究設計與資料處理(2014年再版)。台北:雙葉書廊有限公司。
    俞依良,楊南屏,詹前隆(2012)。比較決策樹演算法與邏輯迴歸模式評估事故傷害就醫之相關因子。北市醫學雜誌,9(1),30-44。
    城田真琴,鍾慧真、梁世英譯(2013)。大數據的獲利模式。台北:經濟新潮社。
    姚昌辰(2014)。以最小平均平方學習法增強貝氏分類器之研究。國立台灣科技大學電機工程研究所碩士論文,台北市。
    徐美、陳明郎(2011)。臺灣不同族群薪資差異的世代變遷。臺灣經濟預測與政策,42, 39-74。
    張昌吉(1992)。我國勞工薪資所得與決定因素之分析,政大勞動學報,1,111-126。
    張嘉鑠(2013)。運用資料探勘技術探討顧客價值與消費行為之研究─以零售業連鎖專賣店為例。國立臺北大學企業管理研究所碩士論文,新北市。
    許榮傑(2008)。應用羅吉斯迴歸模型與決策樹建置信用評分卡。輔仁大學應用統計學研究所碩士論文,新北市。
    陳建良、陳昱彰(2010)。台灣男性的婚姻溢酬:以內生性選擇模型探討。經濟研究, 46,171-216。
    曾仁人(2013)。資料採礦在網路消費行為預測模型之應用。國立政治大學統計研究所碩士論文,台北市。
    游子璇(2014)。應用資源向量機、K個最鄰近法與羅吉斯迴歸於醫療診斷。國立勤益科技大學工業工程與管理研究所碩士論文,台中市。
    黃 婷(2013)。應用資料探勘技術於教師教學評量之研究。銘傳大學應用統計資訊學研究所碩士論文,台北市。
    黃俊揚(2009)。甜蜜的負擔!?探究台灣的[單身寄生族]。國立成功大學政治經濟研究所碩士論文。
    楊亮梅(1993)。中年女性身體活動狀況及健康體能與血脂肪之比較研究。國立臺灣師範大學體育研究所碩士論文,台北市
    楊榮昌(2014)。高齡者運動健康信念,運動參與動機,運動承諾與活躍老化行為關係之研究。國立高雄師範大學成人教育研究所碩士論文,高雄市。
    廖述賢,溫志皓(2009)。資料採礦與商業智慧。台北:雙葉書廊有限公司。
    廖述賢,溫志皓(2012)。資料探勘理論與應用。新北:博碩文化股份有限公司。
    蔡建成(2007)。運用資料探勘技術進行選股決策。國立高雄應用科技大學商務經營研究所碩士論文,高雄市。
    鄧詠心(2011)。婚姻狀態與健康之探討─台灣實證研究。國立臺北大學財政學系研究所碩士論文。
    鄭如筠(2012)。訊息影響力預測:使用 Facebook 資料為例。國立中央大學資訊管理學研究所碩士論文,桃園市。
    鍾寶惜(2012)。成年人的婚姻狀況、社會支持與身心健康之研究。中國文化大學生活應用科學系碩士在職專班,台北市。
    簡禎富、許嘉裕(2014)。資料挖礦與大數據分析。新北:前程文化事業股份有限公司。

    英文部分:
    Augustin, S., Muntaner, L., Altamirano, J. T., González, A., Saperas, E., Dot J, Abu-Suboh M, Armengol JR, Malagelada JR, Esteban R, & Guardia J, Genescà J (2009). Predicting early mortality after acute variceal hemorrhage based on classification and regression tree analysis. Clinical Gastroenterology and Hepatology, 7(12),1347-1354.
    Austin, P. C., Tu, J. V., & Lee, D. S. (2010). Logistic regression had superior performance compared with regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. Journal of Clinical Epidemiology, 63(10), 1145-1155.
    Avellar, S., & Smock, P. (2003). Has the price of motherhood declined over time? A cross-cohort comparison of the motherhood wage penalty. Journal of Marriage and the Family, 65, 597-607.
    Becker, G. S. (1985). The allocation of effort, specific human capital, and the differences between men and women in earning and occupation, Journal of Labor Economics, 3, S33-S58.
    Bergmann, B. R., & Darity, W. A. (1981). Social relations in the workplace and employer discrimination, Proceedings of the Thirty-Third Annual Meetings of the Industrial Relations Research Association , University of Wisconsin, Madison,155-62.
    Berry, M. J. & Linoff, G., (1997). Data Mining Techniques for Marketing. New Jersey, NJ:Wiley Press.
    Blas, E., Sommerfeld, J., & Kurup, A. S. (2011). Social determinants approaches to public health. Geneva, CH:World Health Organization.
    Blaug, M. (1972). The correlation between education and earning: What does it signify? Higher Education, 1, 54.
    Braun, A., Fernandez-Steeger, T., Havenith, H. B., & Torgoev, A. (2015). Landslide Susceptibility Mapping with Data Mining Methods—a Case Study from Maily-Say, Kyrgyzstan. Engineering Geology for Society and Territory, 2, 995-998.
    Breiman, L., Friedman, J., Stone,C.J., & Olshen, R. A. (1984). Classification and Regression trees. Florida, FL: CRC press.
    Camdeviren, H. A., Yazici, A. C., Akkus, Z., Bugdayci, R., & Sungur, M. A. (2007). Comparison of logistic regression model and classification tree: An application to postpartum depression data. Expert Systems with Applications, 32(4), 987-994.
    Chan, P. K., Fan, W., Prodromidis, A. L., & Stolfo, S. J. (1999). Distributed data mining in credit card fraud detection. Intelligent Systems and their Applications, IEEE, 14(6), 67-74
    Chang, C. L., & Chen, C. H. (2009). Applying decision tree and neural network to increase quality of dermatologic diagnosis. Expert Systems with Applications, 36(2), 4035-4041.
    Chao, C. M., Yu, Y. W., Cheng, B. W., & Kuo, Y. L. (2014). Construction the model on the breast cancer survival analysis use support vector machine, logistic regression and decision tree. Journal of Medical Systems, 38(10), 1-7.
    Curt, H. (1995). The devil’s in the detail: Techniques, Tools, and Applications for Database Mining and Knowledge discovery-Part 1. Intelligent Software Strategies, 6(9), 1-15.
    Delen, D., Kuzey, C., & Uyar, A. (2013). Measuring firm performance using financial ratios: A decision tree approach. Expert Systems with Applications, 40(10),3970-3983.
    Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in knowledge discovery and data mining. Cambridge, MA: The MIT press.
    Gangl, M., & Ziefle, A. (2009). Motherhood, labor force behavior, and women’s careers: An empirical assessment of the wage penalty for motherhood in Britain, Germany, and the United States. Demography, 46, 341-69.
    Glauber, R. (2007). Marriage and the motherhood wage penalty among African Americans, Hispanics, and Whites. Journal of Marriage and the Family, 69, 951-61.
    Grupe, F. H., & Mehdi Owrang, M. (1995). Data base mining discovering new knowledge and competitive advantage. Information System Management, 12(4), 26-31.
    Han, J., Kamber, M., & Pei, J. (2000). Data mining: Concepts and Techniques. Burlington, Ma: Morgan Kaufmann Publishers.
    Handan, A.C., Ayse, C.Y., Zeki, A., Resul, B., & Mehmet, A.S. (2007). Comparison of logistic regression model and classification tree: An application to postpartum depression data. Expert Systems with Applications, 32(4), 987-994.
    He, J., Hu, H. J., Harrison, R., Tai, P. C., & Pan, Y. (2006). Transmembrane segments prediction and understanding using support vector machine and decision tree. Expert Systems with Applications, 30(1), 64-72.
    Hersch, J., & Stratton, L. (1997). housework, fixed effects, and wages of married workers. Journal of Human Resources, 32, 285-307.
    Im, J., & Jensen, J. R. (2005). A change detection model based on neighborhood correlation image analysis and decision tree classification. Remote Sensing of Environment, 99(3), 326-340.
    Jian, T., Yueming, H., Changwei, W., & Jianmin, L. (2007). Application of evaluation in farmland with decision tree model based on clustering. Transactions of the Chinese Society of Agricultural Engineering, 12, 58-62.
    Keim, D. A., Panse, C., Sips, M., & North, S. C. (2004). Pixel based visual data mining of geo-spatial data. Computers & Graphics, 28(3), 327-344.
    Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1), 273-324.要
    Korenman, S., & Neumark, D. (1991). Marriage, motherhood, and wages. Journal of Human Resources. 27, 233-255.
    Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining. Berlin, Heidelberg: Springer Science & Business Media.
    Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815-840.
    Ma, C. M., Chao, C. M., & Cheng, B. W. (2012). Predicting patients at risk of acute renal ailure in intencitive care units by using artificial intelligence tools. International Journal of Organizational Innovation, 5(2), 232.
    Meng, X. H., Huang, Y. X., Rao, D. P., Zhang, Q., & Liu, Q. (2013). Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. TheKaohsiung Journal of Medical Sciences, 29(2), 93-99.
    Merriman, K. K. (2014). The psychological role of pay systems in choosing to work more hours. Human Resource Management Review, 24, 67-79.
    Mincer, J. (1974). Schooling, Experience, and Earnings. New York: Columbia University Press.
    Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Decision Support Systems, 35(1), 45-87.
    Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106.
    Quinlan, J. R. (1990). Decision trees and decision-making. Systems, Man and Cybernetics, IEEE Transactions on, 20(2), 339-346.
    Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Elsevier.
    Quinlan, J. R. (1996). Improved use of continuous attributes in C4. 5. Journal of Artificial Intelligence Research, 4, 77-90.
    Resul, Das. (2010). A comparison of multiple classification methods for diagnosis of arkinson disease. Expert Systems with Applications, 37(2), 1568-1572.
    Upadhayay, A., Shukla, S., & Kumar, S. (2012). Empirical Comparison by data mining Classification algorithms (C4.5 &C5.0) for thyroid cancer data set. International Journal of Computer Science & Communication Networks, 3(1), 64-68.
    Wang, M., Gao, K., Wang, L. J., & Miu, X. H. (2012). A novel hyperspectral classification method based on C5.0 decision tree of multiple combined classifiers. In omputational and Information Sciences (ICCIS), 373-376.

    下載圖示
    QR CODE