研究生: |
張祐誠 Chang, Yu-Cheng |
---|---|
論文名稱: |
使用梯度提昇機辨認暗網市場之毒品高衝擊賣家 Identifying High Impact Drug Sellers in Dark Net Marketplaces Using Gradient Boosting Machine |
指導教授: | 侯文娟 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 中文 |
論文頁數: | 61 |
中文關鍵詞: | 暗網 、暗網市場 、購物網站 、藥物 、梯度提昇機 |
英文關鍵詞: | darknet, dnm, marketplace, gbm, gradient boosting machine, XGBoost |
DOI URL: | http://doi.org/10.6345/NTNU202001470 |
論文種類: | 學術論文 |
相關次數: | 點閱:143 下載:11 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究將各樣軟體建構成針對暗網購物市場 (DarkNet Marketplaces, 以下簡稱DNM) 的爬蟲,繞過身份認證、Cookie 過期、Crawler reject(robot.txt) 等機制,透過程式取得研究者需要的 HTML 檔,再交由 Jsoup 函式庫剖析需要的網頁欄位,轉成Json 格式,儲存在本機的資料庫,並透過內建的 cURL 指令存取Elasticsearch (以下簡稱 ES),方便後續維護、備份、以及訓練之自動化。
取得的資料,進一步以梯度提昇機 (Gradient Boosting Machine) 的決策樹機器學習訓練模型,擷取資料中的特徵,找出高衝擊的因素,嘗試預測每個 DNM 中的新貼文,未來可能的衝擊度,進而排列出此 DNM 中的賣家衝擊度排行。本研究嘗試使用藥物本身的生物半衰期作為衝擊度的依據,建構相關程式,從可信賴的網站中取得該藥物的半衰期,並將半衰期轉換為成癮度,作為該藥物對社會造成的量化衝擊。過去研究者曾針對鴉片類 (Opioid) 藥物做量化衝擊,以各類藥物相對於嗎啡的等效劑量 (potency) 作為其衝擊參考。然本研究欲探討較寬廣的藥物定義,故選擇生物半衰期作為量化衝擊。
本研究透過 onion live 作為起點,選擇五個性質不盡相同的 DNM,嘗試建構一套不受限制的爬蟲架構,方便後續研究者取得資料。使用XGBoost 各別對每個 DNM 訓練 GBM 模型,從每個 DNM 中隨機取90%作為訓練資料,另外 10% 作為測試資料, 計算 Precision, Recall 以及 F1 score,可達到 95% 的 F1 分數。
This research utilizes various software into a crawler for the DarkNet Marketplaces (DNM). The crawler first bypasses the authentication , cookie expiration, and crawler rejection (robot.txt) mechanism. It produces the HTML files needed by this research through the crawler and then we hand them over to the Jsoup library to analyze the required fields and convert them to the Json format. After that, we store the data in the localhost database Elasticsearch (ES) by curl commands to facilitate the subsequent maintenance, backup, and automation of training.
The training data is further trained with the decision tree machine learning model of the gradient booster machine. The model is built by extracting the features of the data,
finds out the high impact cause and tries to predict the possible impact in the future of the new posts in each DNM. Finally, the sellers are ranked in this DNM in terms of
impact values.
This research attempts to use the biological half-life of the drug itself as the basis for impact, constructs a program to obtain the half-life of the drug from some trusted website, and converts the half-life into the level of addiction, which is represented as the quantitative impact of the drug on society.
In the past, researchers have made quantitative impacts on opioids, using the equivalent doses (potency) of various drugs relative to morphine as their impact reference. However, this study wants to explore a broader definition of drugs, so the biological half-life is selected as the quantitative impact.
This research chooses onion live as a starting point and selects DNMs with different characteristics to try to construct a set of unrestricted crawler architecture to facilitate subsequent researchers to obtain data.
This research uses XGBoost to train the GBM model for each DNM individually, randomly takes 90% of each DNM as training data, and the other 10% as testing data. The evaluation metrics are Precision, Recall and F1 score. An F1 score of 95% was achieved.
[1] Clearnet (networking). Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Clearnet_(networking). [Accessed Aug. 5, 2020].
[2] What actually is the Darknet. GDATA. [Online]. Available: https://www.gdatasoftware.com/guidebook/what-is-the-darknet-exactly. [Accessed Aug 5, 2020].
[3] Overview. Tor Project. [Online]. Available: https://2019.www.torproject.org/about/overview.html.en. [Accessed Jan. 12, 2020].
[4] Onion Service Protocol. Tor Project. [Online]. Available: https://2019.www.torproject.org/docs/onion-services.html.en. [Accessed Jan. 12, 2020].
[5] Janis, D., Campbell, W., & Mark, C. (2018). Criminal motivation on the dark web: A categorisation model for law enforcement. Digital Investigation, vol. 24, pp.62-71. https://doi.org/10.1016/j.diin.2017.12.003
[6] Heather, L., Andrew, H., Robert, T., & Cliff, Z. (2017). D-miner: A Framework for Mining, Searching, Visualizing, and Alerting on Darknet Events. 2017 IEEE Conference on Communications and Network Security(CNS). DOI: 10.1109/CNS.2017.8228628
[7] Best CAPTCHA Solver Bypass Service. DEATH BY CAPTCHA. [Online]. Available: https://www.deathbycaptcha.com/user/login. [Accessed Aug. 10, 2020].
[8] KIBANA Your window into the Elastic Stack. elastic. [Online]. Available: https://www.elastic.co/kibana. [Accessed Aug. 10, 2020].
[9] Selenium has many projects that combine to form a versatile testing system. Selenium Projects. [Online]. Available: https://www.selenium.dev/projects/. [Accessed Aug. 10, 2020].
[10] Po-Yi Du, Mohammadreza, E., Ning, Z., Hsinchun, C., & Randall, A. B. (2019). Identifying High-Impact Opioid Products and Key Sellers in Dark Net Marketplaces: An Interpretable Text Analytics Approach. 2019 IEEE International Conference on Intelligence and Security Informatics(ISI), pp. 110-115. DOI: 10.1109/ISI.2019.8823196
[11] Tianqi. C., & Carlos. G. (2016). XGBoost: A Scalable Tree Boosting System. KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785-794. https://doi.org/10.1145/2939672.2939785
[12] ONION.live. [Online]. https://onion.live/. Available: [Accessed Aug. 5, 2020].
[13] Billy. B., (2019) Serious Dark Web Warning Issued After Tor Browser Users Have Bitcoin Stolen. Forbes. [Online]. Available: https://www.forbes.com/sites/billybambrough/2019/10/18/serious-dark-web-warning-issued-after-torbrowser-users-have-bitcoin-stolen/#2de79ab41b60. [Accessed Jan. 21, 2020].
[14] DeepDotWeb. Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/DeepDotWeb. [Accessed Jan. 17, 2020].
[15] Kelly. Phillips. E., IRS Followed Bitcoin Transcations, Resulting In Takedown Of The Largest Child Exploitation Site On The Web. Forbes. October 16, 2019. [Online]. Available: https://www.forbes.com/sites/kellyphillipserb/2019/10/16/irs-followed-bitcoin-transactions-resulting-in-takedown-of-the-largest-child-exploitation-site-on-the-web/#437b1601ed0d. [Accessed Feb. 1, 2020].
[16] Nth room case. Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Nth_room_case. [Accessed Aug. 5, 2020].
[17] Installing Elasticsearch. elastic. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html. [Accessed Aug. 5, 2020].
[18] Download and install jsoup. jsoup. [Online]. Available: https://jsoup.org/download. [Accessed Aug. 5, 2020].
[19] Java Platform, Enterprise Edition 8 SDK - Installation Instructions. Oracle. [Online]. Available: https://www.oracle.com/java/technologies/ee8-install-guide.html. [Accessed Aug. 6, 2020].
[20] Installation Guide. XGBoost. [Online]. Available: https://xgboost.readthedocs.io/en/latest/build.html. [Accessed Aug. 6, 2020].
[21] Jason, B., (2016) Data Preparation for Gradient Boosting with XGBoost in Python. [Online]. Machine Learning Mastery. Available: https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/. [Accessed Aug. 5, 2020].
[22] Tianqi, C., Introduction to Boosted Trees. University Of Washington. [Online]. Available: https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf. [Accessed May 26, 2020].