簡易檢索 / 詳目顯示

研究生: 蘇柏豪
Sue, Bo-Hao
論文名稱: 基於機器學習預測有機分子之最高佔據分子軌域與最低未佔據分子軌域及其能隙
Predictions of HOMO, LUMO, and Energy gap of Organic Molecules based on machine learning methods.
指導教授: 蔡明剛
Tsai, Ming-Kang
口試委員: 蔡明剛
Tsai, Ming-Kang
葉丞豪
Yeh, Chen-Hao
張鈞智
Chang, Chun-Chih
口試日期: 2023/07/14
學位類別: 碩士
Master
系所名稱: 化學系
Department of Chemistry
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 103
中文關鍵詞: 機器學習QM9資料集聚類分群法隨機森林
英文關鍵詞: machine learning, Quantum-Machine 9, K-means, random forest
研究方法: 主題分析
DOI URL: http://doi.org/10.6345/NTNU202301435
論文種類: 學術論文
相關次數: 點閱:112下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來科技發展迅速,以大數據的電腦模擬研究也跟著興起,利用機器學習的方式透過演算法來精準預測結果,並輔佐實驗進展,從中尋找出新的可能性已然是種趨勢,而傳統的量化計算耗時長,成本相對高,且只能做少量的分子。
    HOMO、LUMO和Energy gap性質用於化學領域中,因其放光波長、電子傳遞、化學反應性等特性,廣泛應用於有機化學,本研究基於上述問題,使用了機器學習中的分群法、線性及非線性回歸的方式建立模型,逐步針對大量種類的有機化合物進行分析與探討。
    本研究利用機器學習中的Lasso回歸、K-means分群法、隨機森林演算法,用於預測114896種有機化學分子的HOMO、LUMO和能隙(Energy gap)性質,透過本研究之模型,得出:HOMO、LUMO、Energy gap的理論與預測值之MAE小於 0.3 eV,並且非線性回歸模型之校正R2值大於 0.93,顯示模型預測結果高度符合吾人預期之化學性質。
    透過本研究之分析結果,顯示本研究所建立之模型,除了有著良好的預測效果,其篩選出來的描述特徵與一般化學界的認知相吻合,未來可期運用本研究之相關概念與分析方法,對相關領域之數值分析有所貢獻。

    With the rapid development of science and technology in recent years, computer simulation research based on big data is also on the rise. It is a trend to use machine learning to accurately predict the results through algorithms and assist the progress of experiments. Traditional quantitative calculations take a long time, usually expensive, and can only do a small amount of molecules. Comparatively, using computers and machine learning has already become a new trend to find new possibilities.
    The properties of HOMO, LUMO and Energy gap are widely used in the field of organic chemistry because of their light emission wavelength, electron transfer, chemical reactivity and other characteristics. Based on the above properties, this study uses the clustering algorithm in machine learning, linear and nonlinear regression methods to establish machine learning models. The models are used to analyze various kind of organic molecular step by step.
    This research uses Lasso regression, K-means clustering method, and random forest algorithm in machine learning to predict the HOMO, LUMO, and energy gap properties of 114,896 organic chemical molecules. Through the model of this study, it is concluded that the MAE of the theoretical and predicted values of HOMO, LUMO, and Energy gap is less than 0.3 eV, and the corrected R2 value of the nonlinear regression model is greater than 0.93, showing that the predicted results of the model are highly in line with the expected chemical properties.
    Through the analysis results of this study, it is concluded that the model established in the study not only has prediction effect, but also the selected descriptors are consistent with the cognition of general chemistry. In the future, the related concepts and analysis methods of this study can be used to contribute to the numerical analysis of related fields.

    謝辭 ………………………………………………………………..…...i 中文摘要 ……………………………………………………………..ii Abstract ……………………………………………………………iii 目次 ………………………………………….………...……………..iv 表次 ………………………………………...………………………...vi 圖次 ……………………………………….………………………….vii 第一章 緒論 第一節 前言 ………………………………………………1 第二節 研究動機與目的 ………………………………2 第三節 文獻討論 …………………………………………2 第二章 數據背景和計算方法 第一節 資料集(Quantum-Machine 9) …………………11 第二節 描述符 ……………………………………………………13 第三節 描述符篩選與建模 ……………………………………13 壹、方差閾值(Variance Threshold) ………………...13 貳、K-平均聚類分析法(K-means Clustering) …..14 參、輪廓係數(Silhouette Coefficiency) …………15 肆、套索算法(Lasso) ………………………………………16 伍、多元線性回歸(Multiple Linear Regression) .18 陸、隨機森林(Random Forest) ……………………….18 柒、交叉驗證(Cross Validation) ………………………20 第四節 實驗流程 ...……………………………………………21 第三章 結果與討論 第一節 方差閾值篩選之模型分析 …………………………22 第二節 聚類結果分析 …………………………………………24 第三節 套索算法之模型選擇與分析 ……………………26 第四節 隨機森林演算法之模型分析 ………………….…….35 第五節 線性與非線性模型之測試數據 …………….…….39 第四章 結論 結論 …………………………………………………………………41 參考文獻 ……………………………………………………………42 附錄 ……………………………………………………………………45

    1. Xie, Y., Zhang, C., Hu, X., Zhang, C., Kelley, steven P., Atwood, J. L., & Lin, J. (2020). Machine Learning Assisted Synthesis of Metal–Organic Nanocapsules. J. Am. Chem. Soc, 142(3), 1475–1481. https://doi.org/10.1021/jacs.9b11569

    2. Chen, H., Tang, P., Chen, G., Chang, C., & Pao, C. (2021). Microstructure Maps of Complex Perovskite Materials from Extensive Monte Carlo Sampling Using Machine Learning Enabled Energy Model. J. Phys. Chem. Lett., 12(14), 3591–3599. https://doi.org/10.1021/acs.jpclett.1c00410

    3. Tilborg, D. van, alenicheva, A., & Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. J. Chem. Inf. Model., 62(23), 5938–5951. https://doi.org/10.1021/acs.jcim.2c01073

    4. Ruddigkeit, L., Deursen, R. van, Blum, L. C., & Reymond, J. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model., 52(11), 2864–2875. https://doi.org/10.1021/ci300415d

    5. Ramakrishnan, R., Dral, P. O., Rupp, M., & von Lilienfeld, O. A. (2014). Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1, 140022. https://doi.org/10.1038/sdata.2014.22.

    6. Zahrt, A. F., Mo, Y., Nandiwale, K. Y., Shprints, R., Heid, E., & Jensen, K. F. (2022). Machine-Learning-Guided Discovery of Electrochemical Reactions. J. Am. Chem. Soc., 144(49), 22599–22610. https://doi.org/10.1021/jacs.2c08997

    7. Nakayama, H., & Kimura, S. (2011). Suppression of HOMO–LUMO Transition in a Twist Form of Oligo(Phenyleneethynylene) Clamped by a Right-Handed Helical Peptide. J. Phys. Chem. A, 115(32), 8960–8968. https://doi.org/10.1021/jp200997c

    8. Brownell, L. V., Robins, K. A., Jeong, Y., lee, Y., & Lee, D. (2013). Highly Systematic and Efficient HOMO–LUMO Energy Gap Control of Thiophene-Pyrazine-Acenes. J. Phys. Chem. C, 117(48), 25236–25247. https://doi.org/10.1021/jp407269p

    9. Kaur, I., Jia, W., Kopreski, R. P., Selvarasah, S., Dokmeci, M. R., Pramanik, C., Mcgruer, N. E., & Miller, G. P. (2008). Substituent Effects in Pentacenes: Gaining Control over HOMO−LUMO Gaps and Photooxidative Resistances. J. Am. Chem. Soc., 130(48), 16274–16286. https://doi.org/10.1021/ja804515y

    10. Panapitiya, G., Avendaño-Franco, G., Ren, P., Wen, X., Li, Y., & Lewis, J. P. (2018). Machine-Learning Prediction of CO Adsorption in Thiolated, Ag-Alloyed Au Nanoclusters. Journal of the American Chemical Society, 140(50), 17508–17514. https://doi.org/10.1021/jacs.8b08800

    11. Ye, Z. R., Huang, I. S., Chan, Y. T., Li, Z. J., Liao, C. C., Tsai, H. R., Hsieh, M. C., Chang, C. C., & Tsai, M. K. (2020). Predicting the emission wavelength of organic molecules using a combinatorial QSAR and machine learning approach. RSC advances, 10(40), 23834–23841. https://doi.org/10.1039/d0ra05014h

    12. Yap, C. W. (2011). PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. Journal of computational chemistry, 32(7), 1466–1474. https://doi.org/10.1002/jcc.21707

    13. Weininger, D. (1988). SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci., 28(1), 31–36. https://doi.org/10.1021/ci00057a005

    14. Ying, X. (2019). An Overview of Overfitting and Its Solutions. J. Phys.: Conf. Ser., 1168(022022), 1–6. https://doi.org/10.1088/1742-6596/1168/2/022022

    15. Fisher, R. (1919). XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. Earth and Environmental Science Transactions of The Royal Society of Edinburgh, 52(2), 399-433. doi:10.1017/S0080456800012163

    16. Lloyd, S.P. (1982). Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28, 129-136. https://doi.org/10.1109/TIT.1982.1056489

    17. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). California: University of California Press.

    18. Shukla, S. & Naganna, S. (2014). A Review on K-means data clusteringapproach. International Journal of Information & Computation Technology (vol. 4, no. 17, pp. 1847-1860).

    19. Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

    20. Tibshirani, R. (2011). Regression Shrinkage and Selection via the Lasso: A Retrospective. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(3), 273–282. https://doi.org/10.1111/j.1467-9868.2011.00771.x

    21. Kirenz, J. (2021, December 27). Lasso Regression with Python. https://www.kirenz.com/post/2019-08-12-python-lasso-regression-auto/

    22. Fortmann-roe, S. (2012, June). Understanding the Bias-Variance Tradeoff. http://scott.fortmann-roe.com/docs/BiasVariance.html.

    23. Ho, T. kam. (1998). The Random Subspace Method for Constructing Decision Forests (Vol. 20, Issue 8). IEEE. https://doi.org/10.1109/34.709601

    24. 10程式中. (2021, September 26). 多棵決策樹更厲害:隨機森林. IThelp. https://ithelp.ithome.com.tw/articles/10272586

    25. Chwang. (2021, August 1). Machine Learning-交叉驗證(Cross Validation)-找到KNN中適合的K值-Scikit Learn一步一步實作教學. https://chwang12341.medium.com/machine-learning-%E4%BA%A4%E5%8F%89%E9%A9%97%E8%AD%89-cross-validation-%E6%89%BE%E5%88%B0knn%E4%B8%AD%E9%81%A9%E5%90%88%E7%9A%84k%E5%80%BC-scikit-learn%E4%B8%80%E6%AD%A5%E4%B8%80%E6%AD%A5%E5%AF%A6%E4%BD%9C%E6%95%99%E5%AD%B8-4109bf470340.

    26. Golbraikh, A., & Tropsha, A. (2002). Beware of Q2! Journal of Molecular Graphics and Modelling, 20(4), 269–276. https://doi.org/10.1016/s1093-3263(01)00123-1

    27. Musgrave, C. B., & Zhang, G. (2007). Comparison of DFT Methods for Molecular Orbital Eigenvalue Calculations. J. Phys. Chem. A, 111(8), 1554–1561. https://doi.org/10.1021/jp061633o

    28. Hall, L. H., & Kier, L. B. (1995). Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information. J. Chem. Inf. Comput. Sci., 35(6), 1039–1045. https://doi.org/10.1021/ci00028a014

    下載圖示
    QR CODE